# Introduction to Python for Statistics

## Code and Text written by: <br/>
## Laura Nelson ~~ *Assistant Professor of Sociology*<br/><br/>Northeastern University
### Prepared for Professor Kilby <br/> January, 2019
<br/>

# Introduction

As social scientists, we now have access to unprecedented amounts of data. This provides great opportunity, but also different challenges. In particular, it is increasingly important to learn a scripting language, such as Python or R, in order to access, collect, and structure data from diverse sources, and analyze data using new and developing methods, such as machine learning.

In this lecture and hands-on workshop I will provide a brief introduction to the scripting language Python, using the Jupyter platform (what you're looking at now), with an eye toward data representation and analysis.

I am going to throw a lot of syntax at you. There is no way you will remember all of this. Keep your notes, and my notes, as a reference. Focus on learning the syntax, but also getting a higher-level understanding of the way Python works. The syntax will come as you work with it more. If you've never written in Python, all of this may feel very strange to you. It gets easier as you work with it more.

# Learning Goals

- Understand what Python is, why it is useful, and how to use Python for data analysis.
- Understand how Python interacts with, and represents, data. 

# Learning Outcomes

- Learn and be able to explain Python basics - variables, variable types, manipulating variables, and dataframes
- Explain what the Python libraries Pandas and Statsmodels are and what they do
- Write enough code to:
    - Read in a dataset and manipulate a few variables
    - Produce basic summary statistics from a Pandas dataframe
    - Produce three visualization from the dataframe: histogram, scatter plot, and bar chart
    - Implement a simple OLS regression model and interpret the output

    
# Workshop Outline
1. Python basics
2. The Pandas Dataframe
    - data representation
    - data manipulation
    - summary statistics
3. Data visualization using matplotlib
4. StatsModels for statistics

# 1. Python basics

In [None]:
# python is all about functions, variables, and doing things to variables using functions
# the print function

print("Hello, world!")

<a id='arithmetic'></a>
### Arithmetic

In [None]:
# Computers are really good at arithmetic
# Addition

2+5

In [None]:
# Let's have Python report the results from three operations at the same time

print(2-5)
print(2*5)
print(2/5)

In [None]:
# If we have all of our operations in the last line of the cell, Jupyter will print them together

2-5, 2*5, 2/5

In [None]:
# We can also do boolean operators
# And let's compare values
print( 5 == 5)
print( 5 == 3)
2>5

## Variable assignment

Assigning variables is something that we do all the time in programming. These aren't quite like the variables from high school algebra, where *x* represents an unknown to solve for. Instead these are like notes to ourselves that we want to save some value(s) for later use.

Note that the equals sign is directional, like an arrow, telling the computer to give a certain value to a certain label.

In [None]:
# 'a' is being given the value 2; 'b' is given 5

a = 2
b = 5

In [None]:
# Let's perform an operation on the variables

a+b

In [None]:
# Variables can have many different kinds of names

this_number = 2
b/this_number

## Variable types

The type a value has is important. The variable type determines what operators you can use, or how the operators behave.

In [None]:
type('Hellow, world!')

In [None]:
type(17)

In [None]:
type(3.2)

In [None]:
type('17')

In [None]:
print(1,000,000)

In [None]:
type('17')

In [None]:
int('17')

In [None]:
type(int('17'))

In [None]:
float('3.2')

In [None]:
str(17)

## More on operators and variable types

In Python, human language text gets represented as a *string*. These contain sequential sets of characters and they are offset by quotation marks, either double (") or single (').

We will explore different kinds of operations in Python that are specific to human language objects, but it is useful to start by trying to see them as the computer does, as numerical representations.

In [None]:
# The iconic string

print("Hello, World!")

In [None]:
# Assign these strings to variables

a = "Hello"
b = 'World'

In [None]:
# Try out arithmetic operations.
# When we add strings we call it 'concatenation'

print(a+" "+b)
print(a*5)

In [None]:
# Unlike a number that consists of a single value, a string is an ordered
# sequence of characters. We can find out the length of that sequence.

len("Hello, World!")

In [None]:
## Careful about types and operators

In [None]:
x = "string 1"
y = "string 2"
z = 5
w = 3

In [None]:
x + y

In [None]:
z + w

In [None]:
#Why does the following produce an error?
#y + z

In [None]:
##Ex: I have 4 dozen eggs. 
#Write a script that assigns that number to a variable, 
#and prints the total number of eggs I have.

#Ex: Create a variable for hours worked, and another for pay per hour
# Use Python to calculate the total weekly earnings based on these two variables

# 2. The Pandas Dataframe

******************************
The data we'll analyze today comes from:

National Center for Education Statistics, United States Department of Education. (2009). Early Childhood Longitudinal Study, Kindergarten Class of 1998-99 (ECLS-K) [Data file]. Available from http://nces.ed.gov/ecls/kindergarten.asp

I selected five variables (columns) to analyze:

* reading_score = READING IRT SCALE SCORE
* math_score = MATH IRT SCALE SCORE
* knowledge_score = GENERAL KNOWLEDGE IRT SCALE SCORE
* p2income = TOTAL HOUSEHOLD INCOME
* incomecat = INCOME CATEGORES
    * 1 = low income: < \$40,000
    * 2 = mid income
    * 3 = high income: >= \$70,000
    
The unit of observation (row) is the individual kindergartner.  
   
## Motivating Questions

1. Are math, reading, and general knowledge scores related to household income in any predictable way?
2. Can you predict general knowledge scores from reading or math scores? That is, are reading and math skills related to general knowledge?

In [None]:
#import our library
import pandas

In [None]:
df = pandas.read_csv("education_dataset.csv", sep=',')
df.head()

## Dataframe slicing

In [None]:
#syntax to extract columns
df['reading_score'].head()

In [None]:
## Ex: Extract us the knowledge score column
df['knowledge_score'].head()

In [None]:
#extract one row: notice the syntax
df.loc[0]

## Summary Statistics

In [None]:
## Summary statistics
df['reading_score'].mean()

In [None]:
df['reading_score'].sum()

In [None]:
df['reading_score'].std()

In [None]:
##Ex: explore one of the other columns

In [None]:
#We can find it all at the same time

df.describe()

## Differences between means 
What if we want to know if the mean is different across categories?

In [None]:
#advanced: Pandas groupby function
#create a new dataframe that is grouped by income category

df_grouped = df.groupby('incomecat')
df_grouped

In [None]:
df_grouped['reading_score']

In [None]:
df_grouped['reading_score'].mean()

In [None]:
#Ex: explore other categories based on income group

# 3. Visualization

We'll use another library for this: matplotlib

In [None]:
import matplotlib.pyplot as plt

In [None]:
df.hist()
plt.show()

In [None]:
#That's not pretty. Let's show just one
#Ex: how might you do this, knowing what you know about how to pull out a column

df['knowledge_score'].hist()
plt.show()

In [None]:
#Ex: explore you column of choice here

In [None]:
#Other options:
#Scatter plot: is math and reading scores correlated?

df.plot(kind='scatter', x = 'reading_score', y = 'math_score')
plt.show()

In [None]:
#Ex: just based on visuals alone, is there a stronger relationship between math score and general knowledge,
#Or reading score and general knowledge?
df.plot(kind='scatter', x = 'reading_score', y = 'knowledge_score')
plt.show()

In [None]:
df.plot(kind='scatter', x = 'math_score', y = 'knowledge_score')
plt.show()

In [None]:
## Plot average by income
## remember our grouped by plot
## Let's first make another dataframe from it

df_grouped_mean = df_grouped.mean()
df_grouped_mean

In [None]:
## We can plot this like we would the original dataframe!

df_grouped_mean.plot(kind='bar')
plt.show()

In [None]:
df_grouped_mean[['reading_score', 'math_score', 'knowledge_score']].plot(kind='bar')
plt.legend(loc=9, bbox_to_anchor=(0.5, -0.2), ncol = 3)
plt.show()

# 3. Statistics using StatsModels

In [None]:
#We'll use two parts of the statsmodels library, the ttest and ols
import statsmodels.api as sm
from statsmodels.formula.api import ols

## T-test

In [None]:
## Task one: is the difference between categories statistically significant?
#First, one more Pandas skill: data slicing

#pull out a column
df['incomecat'].head()

In [None]:
#test if a row meets a certain condition (remember our boolean variables from above)
(df['incomecat']==1).head()

In [None]:
#pull out only rows that meet that condtion

df[df['incomecat']==1].head()

In [None]:
#Create three new dataframes, for each category
df_cat1 = df[df['incomecat']==1]
df_cat2 = df[df['incomecat']==2]
df_cat3 = df[df['incomecat']==3]
df_cat1.head()

In [None]:
sm.stats.ttest_ind(df_cat1['math_score'], df_cat3['math_score'])

In [None]:
##Ex: produce a t-test for a variable of your choice

## OLS regression

In [None]:
##Is there a relationship between reading score, math score, and general knowledge?
## What of these relationships has more magnitude?
# Is there a relationship controlling for income group?

In [None]:
model = ols("knowledge_score ~ incomecat", df).fit()

In [None]:
print(model.summary()) 

In [None]:
##Ex: Explore the relationship between reading score and knowledge score.
## Do the same for other dependent/independent variables.