# Introduction to Python for Statistics

## Code and Text written by: <br/>
## Laura Nelson ~~ *Assistant Professor of Sociology*<br/><br/>Northeastern University
### Prepared for Professor Kilby <br/> January, 2019
<br/>

# Introduction

As social scientists, we now have access to unprecedented amounts of data. This provides great opportunity, but also different challenges. In particular, it is increasingly important to learn a scripting language, such as Python or R, in order to access, collect, and structure data from diverse sources, and analyze data using new and developing methods, such as machine learning.

In this lecture and hands-on workshop I will provide a brief introduction to the scripting language Python, using the Jupyter platform (what you're looking at now), with an eye toward data representation and analysis.

I am going to throw a lot of syntax at you. There is no way you will remember all of this. Keep your notes, and my notes, as a reference. Focus on learning the syntax, but also getting a higher-level understanding of the way Python works. The syntax will come as you work with it more. If you've never written in Python, all of this may feel very strange to you. It gets easier as you work with it more.

# Learning Goals

- Understand what Python is, why it is useful, and how to use Python for data analysis.
- Understand how Python interacts with, and represents, data. 

# Learning Outcomes

- Learn and be able to explain Python basics - variables, variable types, manipulating variables, and dataframes
- Explain what the Python libraries Pandas and Statsmodels are and what they do
- Write enough code to:
    - Read in a dataset and manipulate a few variables
    - Produce basic summary statistics from a Pandas dataframe
    - Produce three visualization from the dataframe: histogram, scatter plot, and bar chart
    - Implement a simple OLS regression model and interpret the output

    
# Workshop Outline
1. Python basics
2. The Pandas Dataframe
    - data representation
    - data manipulation
    - summary statistics
3. Data visualization using matplotlib
4. StatsModels for statistics

# 1. Python basics

In [1]:
# python is all about functions, variables, and  doing things to variables

print("Hello, world!")

Hello, world!


<a id='arithmetic'></a>
### Arithmetic

In [2]:
#computers are great at arithmetic

2+5

7

In [4]:
(2-5)
print(2*5)
2/5

10


0.4

In [6]:
2-5, 2*5, 2/5
(2+5) * 3

21

In [10]:
print(5==5)
print(5==3)
#2>5
print(2!=3)

True
False
True


## Variable assignment

Assigning variables is something that we do all the time in programming. These aren't quite like the variables from high school algebra, where *x* represents an unknown to solve for. Instead these are like notes to ourselves that we want to save some value(s) for later use.

Note that the equals sign is directional, like an arrow, telling the computer to give a certain value to a certain label.

In [11]:
a = 2
b = 5

In [12]:
a+b

7

In [14]:
this_number = 2
b/this_number

2.5

## Variable types

The type a value has is important. The variable type determines what operators you can use, or how the operators behave.

In [15]:
type('Hello, world!')

str

In [16]:
type(17)

int

In [17]:
type(3.2)

float

In [18]:
type('17')

str

In [21]:
type(int('17'))

int

In [22]:
str(3.2)

'3.2'

In [24]:
int(7.8)

7

In [25]:
float(7)

7.0

In [26]:
int(0.044)

0

In [29]:
print(17+2)
print('17' + 2)

19


TypeError: must be str, not int

## More on operators and variable types

In Python, human language text gets represented as a *string*. These contain sequential sets of characters and they are offset by quotation marks, either double (") or single (').

We will explore different kinds of operations in Python that are specific to human language objects, but it is useful to start by trying to see them as the computer does, as numerical representations.

In [31]:
a = "Hello"
b = "world"

In [32]:
a+b

'Helloworld'

In [33]:
a*5

'HelloHelloHelloHelloHello'

In [39]:
##Ex: I have 4 dozen eggs. 
#Write a script that assigns that number to a variable, 
#and prints the total number of eggs I have.

num_dozens = 7
print(num_dozens * 12)


#Ex: Create a variable for hours worked, and another for pay per hour
# Use Python to calculate the total weekly earnings based on these two variables

hours = '35'
pay = 256

print(hours*pay)

84
35353535353535353535353535353535353535353535353535353535353535353535353535353535353535353535353535353535353535353535353535353535353535353535353535353535353535353535353535353535353535353535353535353535353535353535353535353535353535353535353535353535353535353535353535353535353535353535353535353535353535353535353535353535353535353535353535353535353535353535353535353535353535353535353535353535353535353535353535353535353535353535353535353535353535353535353535353535353535353535353535353535353535353535353535353535


# 2. The Pandas Dataframe

******************************
The data we'll analyze today comes from:

National Center for Education Statistics, United States Department of Education. (2009). Early Childhood Longitudinal Study, Kindergarten Class of 1998-99 (ECLS-K) [Data file]. Available from http://nces.ed.gov/ecls/kindergarten.asp

I selected five variables (columns) to analyze:

* reading_score = READING IRT SCALE SCORE
* math_score = MATH IRT SCALE SCORE
* knowledge_score = GENERAL KNOWLEDGE IRT SCALE SCORE
* p2income = TOTAL HOUSEHOLD INCOME
* incomecat = INCOME CATEGORES
    * 1 = low income: < \$40,000
    * 2 = mid income
    * 3 = high income: >= \$70,000
    
The unit of observation (row) is the individual kindergartner.  
   
## Motivating Questions

1. Are math, reading, and general knowledge scores related to household income in any predictable way?
2. Can you predict general knowledge scores from reading or math scores? That is, are reading and math skills related to general knowledge?

In [43]:
import pandas

In [49]:
df = pandas.read_csv('education_dataset.csv', sep = ',')
df

Unnamed: 0,reading_score,math_score,knowledge_score,p2income,incomecat
0,36.58,39.54,33.822,140000.0,3
1,50.82,44.44,38.147,120000.0,3
2,40.68,28.57,28.108,90000.0,3
3,32.57,23.57,15.404,50000.0,2
4,31.98,19.65,18.727,55000.0,2
5,50.45,36.27,33.352,150000.0,3
6,32.49,20.82,26.211,42000.0,2
7,33.30,26.85,27.072,70000.0,3
8,65.92,47.36,33.514,100000.0,3
9,34.20,22.27,28.096,78000.0,3


## Dataframe slicing

In [51]:
df[['reading_score', 'math_score']]

Unnamed: 0,reading_score,math_score
0,36.58,39.54
1,50.82,44.44
2,40.68,28.57
3,32.57,23.57
4,31.98,19.65
5,50.45,36.27
6,32.49,20.82
7,33.30,26.85
8,65.92,47.36
9,34.20,22.27


In [52]:
#EX: extract the knowledge score column
df['knowledge_score']

0        33.822
1        38.147
2        28.108
3        15.404
4        18.727
5        33.352
6        26.211
7        27.072
8        33.514
9        28.096
10       11.269
11       24.710
12       21.195
13       14.700
14       23.977
15       18.558
16       20.702
17       27.976
18       33.879
19       25.621
20       23.991
21       14.994
22       27.330
23       34.184
24       13.810
25       11.294
26       20.634
27       14.549
28        9.223
29        8.074
          ...  
11903    15.665
11904    19.347
11905     9.891
11906    32.341
11907    30.073
11908    32.667
11909    28.321
11910    16.778
11911    15.975
11912    15.387
11913    20.170
11914    18.566
11915    19.708
11916    29.277
11917    23.930
11918    16.654
11919    26.804
11920    32.108
11921    23.238
11922    21.898
11923    27.221
11924     9.517
11925    13.518
11926    23.342
11927    22.788
11928    11.694
11929    21.461
11930    16.836
11931    28.864
11932    15.256
Name: knowledge_score, L

In [58]:
df.iloc[2]

reading_score         40.680
math_score            28.570
knowledge_score       28.108
p2income           90000.000
incomecat              3.000
Name: 2, dtype: float64

## Summary Statistics

In [59]:
df['reading_score'].mean()

35.95421520154177

In [60]:
df['reading_score'].sum()

429041.6499999979

In [63]:
df['math_score'].std()

9.120505071228266

In [64]:
df.describe()

Unnamed: 0,reading_score,math_score,knowledge_score,p2income,incomecat
count,11933.0,11933.0,11933.0,11933.0,11933.0
mean,35.954215,27.128244,23.073694,54317.19993,1.895165
std,10.47313,9.120505,7.396978,36639.061147,0.822692
min,21.01,10.51,6.985,1.0,1.0
25%,29.34,20.68,17.385,27000.0,1.0
50%,34.06,25.68,22.954,47000.0,2.0
75%,39.89,31.59,28.305,72000.0,3.0
max,138.51,115.65,47.691,150000.0,3.0


In [None]:
#EX: find the mean knowledge_score


## Differences between means 
What if we want to know if the mean is different across categories?

In [None]:
#Ex: explore other categories based on income group

# 3. Visualization

We'll use another library for this: matplotlib

In [None]:
#Ex: explore you column of choice here

In [None]:
#Ex: just based on visuals alone, is there a stronger relationship between math score and general knowledge,
#Or reading score and general knowledge?

# 3. Statistics using StatsModels

## T-test

In [None]:
##Ex: produce a t-test for a variable of your choice

## OLS regression

In [2]:
##Is there a relationship between reading score, math score, and general knowledge?
## What of these relationships has more magnitude?
# Is there a relationship controlling for income group?

In [3]:
##Ex: Explore the relationship between reading score and knowledge score.
## Do the same for other dependent/independent variables.