# Introduction to Python for Statistics

## Code and Text written by: <br/>
## Laura Nelson ~~ *Assistant Professor of Sociology*<br/><br/>Northeastern University
### Prepared for Professor Kilby <br/> January, 2019
<br/>

# Introduction

As social scientists, we now have access to unprecedented amounts of data. This provides great opportunity, but also different challenges. In particular, it is increasingly important to learn a scripting language, such as Python or R, in order to access, collect, and structure data from diverse sources, and analyze data using new and developing methods, such as machine learning.

In this lecture and hands-on workshop I will provide a brief introduction to the scripting language Python, using the Jupyter platform (what you're looking at now), with an eye toward data representation and analysis.

I am going to throw a lot of syntax at you. There is no way you will remember all of this. Keep your notes, and my notes, as a reference. Focus on learning the syntax, but also getting a higher-level understanding of the way Python works. The syntax will come as you work with it more. If you've never written in Python, all of this may feel very strange to you. It gets easier as you work with it more.

# Learning Goals

- Understand what Python is, why it is useful, and how to use Python for data analysis.
- Understand how Python interacts with, and represents, data. 

# Learning Outcomes

- Learn and be able to explain Python basics - variables, variable types, manipulating variables, and dataframes
- Explain what the Python libraries Pandas and Statsmodels are and what they do
- Write enough code to:
    - Read in a dataset and manipulate a few variables
    - Produce basic summary statistics from a Pandas dataframe
    - Produce three visualization from the dataframe: histogram, scatter plot, and bar chart
    - Implement a simple OLS regression model and interpret the output

    
# Workshop Outline
1. Python basics
2. The Pandas Dataframe
    - data representation
    - data manipulation
    - summary statistics
3. Data visualization using matplotlib
4. StatsModels for statistics

# 1. Python basics

<a id='arithmetic'></a>
### Arithmetic

## Variable assignment

Assigning variables is something that we do all the time in programming. These aren't quite like the variables from high school algebra, where *x* represents an unknown to solve for. Instead these are like notes to ourselves that we want to save some value(s) for later use.

Note that the equals sign is directional, like an arrow, telling the computer to give a certain value to a certain label.

## Variable types

The type a value has is important. The variable type determines what operators you can use, or how the operators behave.

## More on operators and variable types

In Python, human language text gets represented as a *string*. These contain sequential sets of characters and they are offset by quotation marks, either double (") or single (').

We will explore different kinds of operations in Python that are specific to human language objects, but it is useful to start by trying to see them as the computer does, as numerical representations.

In [1]:
##Ex: I have 4 dozen eggs. 
#Write a script that assigns that number to a variable, 
#and prints the total number of eggs I have.

#Ex: Create a variable for hours worked, and another for pay per hour
# Use Python to calculate the total weekly earnings based on these two variables

# 2. The Pandas Dataframe

******************************
The data we'll analyze today comes from:

National Center for Education Statistics, United States Department of Education. (2009). Early Childhood Longitudinal Study, Kindergarten Class of 1998-99 (ECLS-K) [Data file]. Available from http://nces.ed.gov/ecls/kindergarten.asp

I selected five variables (columns) to analyze:

* reading_score = READING IRT SCALE SCORE
* math_score = MATH IRT SCALE SCORE
* knowledge_score = GENERAL KNOWLEDGE IRT SCALE SCORE
* p2income = TOTAL HOUSEHOLD INCOME
* incomecat = INCOME CATEGORES
    * 1 = low income: < \$40,000
    * 2 = mid income
    * 3 = high income: >= \$70,000
    
The unit of observation (row) is the individual kindergartner.  
   
## Motivating Questions

1. Are math, reading, and general knowledge scores related to household income in any predictable way?
2. Can you predict general knowledge scores from reading or math scores? That is, are reading and math skills related to general knowledge?

## Dataframe slicing

## Summary Statistics

## Differences between means 
What if we want to know if the mean is different across categories?

In [None]:
#Ex: explore other categories based on income group

# 3. Visualization

We'll use another library for this: matplotlib

In [None]:
#Ex: explore you column of choice here

In [None]:
#Ex: just based on visuals alone, is there a stronger relationship between math score and general knowledge,
#Or reading score and general knowledge?

# 3. Statistics using StatsModels

## T-test

In [None]:
##Ex: produce a t-test for a variable of your choice

## OLS regression

In [2]:
##Is there a relationship between reading score, math score, and general knowledge?
## What of these relationships has more magnitude?
# Is there a relationship controlling for income group?

In [3]:
##Ex: Explore the relationship between reading score and knowledge score.
## Do the same for other dependent/independent variables.