# Introduction to Python 

## Code and Text written by: <br/>
## Laura Nelson ~~ *Assistant Professor of Sociology*<br/><br/>Northeastern University

<br/>

# Introduction

It is increasingly important to learn a scripting language, such as Python or R, in order to access, collect, and structure data from diverse sources, and analyze data using new and developing methods, such as machine learning.

There is no way you will remember all of this and that is completely normal. Focus on learning the syntax, but also getting a higher-level understanding of the way Python works. This is just a basic introduction to understand how python works and what you can do with it, but there are a lot of resources out there if this is something that interests you. If you've never written in Python, all of this may feel very strange to you. It gets easier as you work with it more.

# Learning Goals

- Understand what Python is, why it is useful, and how to use Python for data analysis.
- Understand how Python interacts with, and represents, data. 

# Learning Outcomes

- Learn and be able to explain Python basics - introduction, arithmatic, dataframes, visualizations
    
# Workshop Outline
1. Python basics
2. Dataframes
3. Visualizations


# 1. Python basics

In [None]:
# python is all about functions, variables, and doing things to variables using functions
# the print function

print("Hello, world!")

In [None]:
##try printing something yourself here!


<a id='arithmetic'></a>
### Arithmetic

In [None]:
# Computers are really good at arithmetic
# Addition

2+5

In [None]:
# Let's have Python report the results from three operations at the same time

print(2-5)
print(2*5)
print(2/5)

In [None]:
## Take 5 minutes and run some algebra here!


# The Pandas Dataframe

******************************
The data we'll analyze today comes from:

National Center for Education Statistics, United States Department of Education. (2009). Early Childhood Longitudinal Study, Kindergarten Class of 1998-99 (ECLS-K) [Data file]. Available from http://nces.ed.gov/ecls/kindergarten.asp

I selected five variables (columns) to analyze:

* reading_score = READING IRT SCALE SCORE
* math_score = MATH IRT SCALE SCORE
* knowledge_score = GENERAL KNOWLEDGE IRT SCALE SCORE
* p2income = TOTAL HOUSEHOLD INCOME
* incomecat = INCOME CATEGORES
    * 1 = low income: < \$40,000
    * 2 = mid income
    * 3 = high income: >= \$70,000
    
The unit of observation (row) is the individual kindergartner.  
   
## Motivating Questions

1. Are math, reading, and general knowledge scores related to household income in any predictable way?
2. Can you predict general knowledge scores from reading or math scores? That is, are reading and math skills related to general knowledge?

In [None]:
#import our library 
import pandas

In [None]:
df = pandas.read_csv("education_dataset.csv", sep=',')
df

In [None]:
#It's easier to view the information when you have a piece of it instead of the entire thing, 
#especially when you have a LOT of data
df.head()

In [None]:
#syntax to extract columns
df['reading_score'].head()

In [None]:
## Try it yourself: Extract the knowledge score column



In [None]:
#extract one row: notice the syntax
df.loc[0]

In [None]:
## Try it yourself - extract the 20th row of data 



## Summary Statistics

In [None]:
## Summary statistics
df['reading_score'].mean()

In [None]:
df['reading_score'].sum()

In [None]:
df['reading_score'].std()

In [None]:
##Ex: explore one of the other columns - 
#find the mean, sum and standard deviation of another column that interests you!




In [None]:
#We can find it all at the same time

df.describe()

# Visualization

In [None]:
import matplotlib.pyplot as plt

In [None]:
df.hist()
plt.show()

In [None]:
#That's not pretty. Let's show just one

df['knowledge_score'].hist()
plt.show()

In [None]:
## Try doing a histogram with another column that interests you




In [None]:
#Other options:
#Scatter plot: is math and reading scores correlated?

df.plot(kind='scatter', x = 'reading_score', y = 'math_score')
plt.show()

In [None]:
## Finish up by looking at scatterplots of different relationships on your own
## What are some patterns you find? Ex: Is there a relationship between scores and household income?



In [None]:
#advanced: Pandas groupby function
#create a new dataframe that is grouped by income category

df_grouped = df.groupby('incomecat')
df_grouped 
df_grouped_mean = df_grouped.mean()
df_grouped_mean

In [None]:
df_grouped_mean[['reading_score', 'math_score', 'knowledge_score']].plot(kind='bar')
plt.legend(loc=9, bbox_to_anchor=(0.5, -0.2), ncol = 3)
plt.show()

## Want to learn more?

If you are interested in Computational Social Science, data analytics, ethical implications, and any of the topics we covered today, we encourage you to begin looking at potential courses or minors you might pursue!

- Computational Social Science minor
- Digital Minor
- Combined major in Computer Science and CSSH
- Other courses you might take: DS 2000/DS 20001 (Data Science) 


### Follow the NULab for workshops, events, potential courses, and more!

- https://web.northeastern.edu/nulab/ 
- @NUlabTMN on Twitter

### Our Contact Information
- Laura Nelson (l.nelson@northeastern.edu), Department of Sociology and Anthropology
- Cara Messina (messina.c@husky.neu.edu), NULab Coordinator: Office Hours: 409 Nightingale Hall, Tuesdays 12-1
- Alexis Yohros (yohros.a@husky.neu.edu), Digital Teaching Integration RA