![DSL_logo](https://github.com/BrockDSL/Intro_to_Python_Workshop/blob/master/dsl_logo.png?raw=1) 

![PythonPsycology](http://www.pygaze.org/wp-content/uploads/2016/10/Dalmaijer_2016_PEP-1024x356.jpg) 

# Data Science with Python!

Welcome Brock Psychology Society to the Digital Scholarship Lab Level 2 Python workshop. In yesterday's Python workshoop, we covered the following:

- variables
- math
- conditional
- loops
- functions


What we'll learn today is:
- importing Pandas and Numpy libraries
- analyzing data with Pandas, MatPlotlib, and Seaborn


We'll be using Python as a Data Analysis tool

Before we get going the next cell should look totally familar to you

In [None]:
scores = [3,5,6,2,1,6]

def find_mean(scores):
    
    sum = 0
    for s in scores:
        sum = sum + s
        
    return sum/len(scores)


print(find_mean(scores))

----

## Importing Libraries

- Our end goal is to re-use as much code as possible
- To do this we load in different Libraries using the `import` command
- For this example we want to load in the [statistics](https://docs.python.org/3/library/statistics.html) library

In [None]:
import statistics

print(statistics.mean(scores))
print(statistics.median(scores))
print(statistics.mode(scores))

- Try Q1 - Q2 below and type "Got it" in the chat when you are done.

- **Q1** We can use the [math](https://docs.python.org/3/library/math.html) library to do interesting calculations, but we need to import it first. Eg. the function used to find the square root of a number is called `math.sqrt()` Modify the following code to print out the square root of the variable `number`.

In [None]:
import math

number = 81

print(number)

The `str` library is so important that it is included all the time Python runs.

- **Q2**  Play around with printing the contents of the variable `all_caps` using different capitalization commands, as described in the cell's comments. (Details on the [str](https://docs.python.org/3/library/string.html) library, if your interested)

In [None]:
all_caps = "HELLO PYTHON USER"

# add .lower() to the following line so that the variable represented by all_caps prints in all lowercase
print(all_caps)
# add .title() to the following line to capitalize the first letter of each word, and the rest lowercase
print(all_caps)
# add .capitalize() to the following line to capitalize only the first letter of the sentence, and the rest lowercase
print(all_caps)

# EXERCISE: Analyzing scores from a Big 5 Personality Dataset

We'll be focusing on data analysis for the rest of this workshop so let's import some libraries: [pandas](https://pandas.pydata.org/) and [numpy](https://pythonistaplanet.com/numpy/)

We will be analyzing a dataset composed of about 315 scores of the Big 5 Personality test.

You could use Excel to do some of this analysis true, but if you have a large dataset, using Excel is going to be difficult to work with.

![excel preview](https://raw.githubusercontent.com/BrockDSL/BrockPsych_Python_Collaboration_2021/master/psycscreenshot.PNG) 


There are 315 rows of big 5 personality score data - this is just a sample view

The personality dataset has 8 columns 

-Gender	
-Age	
-Openness	
-Neuroticism	
-Conscientiousness	
-Agreeableness	
-Extraversion	
-Personality

The rows represent different people who took the big 5 test

The personality column is the one we are interested in, and we want to analyze the data to see how the other columns (I.e. personality traits) are affecting the overall personality.  

## Loading the Python Libraries

To get Python and the notebook ready we need to load the following cell

Numpy is a library that allows you to:

    - Use mathematical operations on matrices (lists, as we saw yesterday)

Pandas is a library used with numpy that allows you to:

    - Create Dataframes: a data structure consisting of rows and columns
    - Assign functions and other analytical operations to that dataframe

In [None]:
#Load the Library Pandas, that works with data
import pandas as pd

#Load the Library Numpy, that works with numerical calculations
import numpy as np

#These two libraries are often used together!

In [None]:
#Load the uploaded file into a dataframe
df = pd.read_csv("https://brockdsl.github.io/BrockPsych_Python_Collaboration_2021/psyc.csv")

#List the column names so we can use them as variables 
df.columns = ["gender","age","openness","neuroticism","conscientiousness","agreeableness","extraversion","Personality"] 

df.columns = data.columns.str.title()

df.head(10)

Pandas can provide us some nice quantitative details about our data by calling the `describe()` function

In [None]:
data.describe()

## Grouping and  Counting

- We also need to gather the entries we need by grouping them together with the `.groupby()` function. We can chain these things together to ask very specific questions of the data.
- We pass what column we'd like to group the data by
- We add `.count()` if we are just interested int the counts and not the dataframe

Group the red wine samples by the ranking numbers that were assigned during the wine testing study

In [None]:
data.groupby('Personality')

In [None]:
data.groupby("Personality").count()

Try questions Q4 & Q5 below and type "Finished!" in the chat box when you are done

**Q4** How many people are female in this dataset?

**Q5** How many different ages are there in the dataset?

## Grouping and applying functions

- If we want to do some math on the data we need to cluster it together a bit. We use `.groupby()` and then apply our mathematical functions to the result
- Here we'll use the following 3 functions:
 - `mean()` finds the arithmetic mean of the data
 - `max()` finds the largest occurence of data in that column
 - `min()` finds the smallest occurennce of data in that column

What is the average extroversion score of people with a personality label of 'extroverted'?

In [None]:
data.groupby("Personality")["Extraversion"].mean()

Try questions Q6-Q8 and type "All done" into the chat when you are finished"

- **Q6** What is the average Age of people of each `Neuroticism` score?

- **Q7** What is the minimum and maximum Age seen in the data

- **Q8** What is the minimum and maximum Conscientiousness score seen for each personality?

# Sorting & Multi line commands 

- We can apply sorting to our dataframe actions by using the funciton `.sort_values()` 

- We need to give what column we'd like to sort it with `by =` 

- We also need to tell it to display it in an increase way `ascending = False` 

What Agreeableness score has the most people assigned to it? Here we do it in two steps 

In [None]:
by_Agreeableness = data.groupby("Agreeableness").count() 

sorted_Agreeableness = by_Agreeableness.sort_values(by = "Personality",ascending = False) 

sorted_Agreeableness

We could also do it in one step: 

In [None]:
data.groupby("Agreeableness").count().sort_values(by = "Personality",ascending = False) 

## Unique entries & values counts 

- Here we use `.unique()` to only give the first instances of the item. Results are returned as a list, which is useful for us later 

- This is useful for seeing how many values are in a categorical column 

In [None]:
data["Openness"].unique() 

What are unique values for the Chlorides field? 

In [None]:
data["Age"].unique() 

- To get total number of unique values and frequency in the data we use `value_counts()'  

In [None]:
data["Age"].value_counts() 

## Selecting subsets of data 

- To make life easier we can create dataframes that just have the values we are interested in 

- This is a bit more complicated but follows this type of pattern: 

``` 

dataframe[dataframe[search criteria]] 

``` 

- We are basically creating a subset of the dataframe by matching all entries that match `search criteria` 

- That search criteria can be anything that is a conditional 

- Doing this gives you a new dataframe 

EG. A new dataframe of people with an Extroversion score over 6

In [None]:
over_6 = df[df["Extraversion"] > 6] 

over_6.head(10)

EG. If we want the count of red wine with a fixed acidity greater than 14, we apply the .count() function to what we selected 

In [None]:
over_6.count() 

This can be done in 1 line as well 

In [None]:
data[data["Extraversion"] > 6].count() 

Try Q9-Q10 below and type "I got it" into the chat when you are done 

- **Q9** Can you make a new dataframe that just has people with a `Conscientiousness` score of 9 in it. Display the first 5 entries.

In [None]:
Concientious9_People = 
Concientious9_People.head()

- **Q10** Can you 'describe' the newly created dataframe, to get some basic information on the columns in the dataframe?

# Some questions now

Let's first make a dataframe of all of the serious people

In [None]:
serious_people = data[data["Personality"] == "serious"]
serious_people

Try answering Q11 - Q14, type "Finished" into the chat when you are done

- **Q11** How can we sort our `serious_people` dataframe?

- **Q12** What is the average Openness score of those in the `serious_people` dataset ?

- **Q13** What is the max Age of those in the `serious_people` dataset?

- **Q14** What percentage of people in the `serious_people` dataset have a neuroticism score greater than 5? (This is probably the most complex question of the day, feel free to take as much time as you need to answer it)

# Another Library, MatplotLib

Let's take a look at graphing our results. We can use the `matplotlib` library to generate some graphs of our results. We always gives lists as parameters for the graphs

In [None]:
#This line is for Jupyter's benefit
%matplotlib inline
#Import MayPlotLib to graph some results
import matplotlib.pyplot as plt

In [None]:
#Load the file
graph_data = pd.read_csv("https://brockdsl.github.io/BrockPsych_Python_Collaboration_2021/psyc.csv")

graph_data.columns = ["gender","age","openness","neuroticism","conscientiousness","agreeableness","extraversion","personality"] 

## Donut Charts
Let's draw a donut chart of the number of people of each personality label.

In [None]:
#All of the dependable people
Total_dependable = graph_data[graph_data["personality"] == "dependable"]["personality"].count()
print("Dependable People: " + str(Total_dependable))

#All of the serious people
Total_serious = graph_data[graph_data['personality'] == "serious"]["personality"].count()
print("Serious People: "+ str(Total_serious))

#All of the extraverted people
Total_extraverts = graph_data[graph_data["personality"] == "extraverted"]["personality"].count()
print("Extraverts: "+ str(Total_extraverts))

#All of the lively people
Total_lively = graph_data[graph_data["personality"] == "lively"]["personality"].count()
print("Lively People: "+ str(Total_lively))

#All of the responsible people
Total_responsible = graph_data[graph_data["personality"] == "responsible"]["personality"].count()
print("Responsible People: "+ str(Total_responsible))

# Matplot lib always wants data in a list, so we'll make one
pie_data = [Total_dependable,Total_serious,Total_extraverts,Total_lively,Total_responsible]
pie_labels = ["Dependable People", "Serious People", "Extraverts", "Lively People", "Responsible People"]
plt.pie(pie_data,labels=pie_labels, colors = ("red", "pink", "blue", "cyan", "purple"))

# Add a circle to create a hole in the pie chart
centre_circle = plt.Circle((0, 0), 0.70, fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)


plt.show()

##Histograms

Say we wanted to plot the personality distribution of our data

In [None]:
# bins is the number of containers we'll split our x-axis values into
bins = 5

plt.hist(graph_data["personality"],bins, color=('red'), alpha=(0.9), hatch="x", edgecolor='white')

plt.title("Personality Distribution", color=(0.2,0.6,0.4,0.6), size=30)
plt.xlabel("Personality", size=20)
plt.ylabel("Occurrences", size=20)

#Set Background colour
plt.gca().set_facecolor('lightblue')
plt.gca().set_axis_on()

#Change the color of the x and y values
ax = plt.gca()
ax.tick_params(axis='x', colors='brown')
ax.tick_params(axis='y', colors='blue')

plt.show()

**Q15** Can you create a pie graph that shows the gender distribution in the data? You just need to modify line 2 & 6

In [None]:
#Fill in the following
Females = graph_data[graph_data["ChangeMe"] == "ChangeMe"]["gender"].count()
print("Females: "+ str(females))

#Fill in the following
Males = graph_data[graph_data["ChangeMe"] == "ChangeMe" ]["gender"].count()
print("Males: "+ str(males))

pie_data = [Females,Males]
pie_labels = ["Females","Males"]
plt.pie(pie_data,labels=pie_labels)

plt.show()

**Q16** Can you draw a histogram of the Neuroticism distribution? Make sure to give it the axes good descriptions. You just need to modify line 1,5, & 6. (The example above should help you)

In [None]:
bins = #FILL

plt.hist(graph_data["neuroticism"],bins) 
plt.title("neuroticism Distribution")
plt.xlabel("ChangeMe") #FILL
plt.ylabel("ChangeMe") #FILL

plt.show()


##Another Library: Seaborn
Seaborn can do the same charts as Matplotlib, along with correlational charts that visualize correlations between variables.

Install Seaborn

In [None]:
pip install seaborn

Import Seaborn library, and ensure that numpy, matplotlib, and pandas are also imported

In [None]:
import numpy as np 
import matplotlib.pyplot as plt
import pandas as pd 
import seaborn as sns

Create a violin plot to compare distribution between two variables

In [None]:
#Create a violin chart
sns.violinplot("Gender", "Openness", data=df,
               palette=["hotpink", "cyan"]);

#Create a title
plt.title('Comparing the Distribution of Openness for Females and Males')

**Q17** Create a violin plot to compare the distribution between openness and age

In [None]:
#Create a violin chart


# Congrats!

You now know a bit about Python Libraries and using some data analysis tools