# Using Python: A Brief Introduction</center>

![](https://blogs.gartner.com/doug-laney/files/2015/01/big-data-word-cloud.jpg)


## <center><br/>Cara Marta Messina, Digital Teaching Integration Assistant Director<br/>Northeastern University</center>

<center>Prepared for the week of March 16th, 2020<br/>Online iteration. For instructions visit BIT.LY LINK</center>

# Introduction to Python 

## Code and Text written by:
## Laura Nelson -- *Assistant Professor of Sociology*<br/><br/>Northeastern University

<br/>

# Introduction

It is increasingly important to learn a scripting language, such as Python or R, in order to access, collect, and structure data from diverse sources, and analyze data using new and developing methods, such as machine learning.

There is no way you will remember all of this and that is completely normal. Focus on learning the syntax, but also getting a higher-level understanding of the way Python works. This is just a basic introduction to understand how python works and what you can do with it, but there are a lot of resources out there if this is something that interests you. If you've never written in Python, all of this may feel very strange to you. It gets easier as you work with it more.

# Learning Goals

- Understand what Python is, why it is useful, and how to use Python for data analysis.
- Understand how Python interacts with, and represents, data. 

# Learning Outcomes

- Learn and be able to explain Python basics - introduction, arithmatic, dataframes, visualizations
    
# Workshop Outline
1. Python basics
2. Dataframes
3. Visualizations


# 1. Python basics
For the basics, you will be learning how to run <strong>cells</strong>, which are the gray boxes below. Each cell has an <strong>input,</strong> which is your code, and an <em>output</em>, which is the result of your code. # python is all about functions, variables, and doing things to variables using functions

The first function you will learn is the "print" function. The "print" function prints whatever is inside the parenthesis. If you want to write text, you use the quotation marks: "Hello, world!" Anything in quotes is called a <strong>string</strong>. Strings are mostly textual data.

If you want to print the results of arithmetic or a function, you do not need the quotation mark.

To <strong>run a cell,</strong> click the cell you want to run and then click the Run button in your toolbox at the top.

In [None]:
# If you would like to comment in a cell, simply use the hashtag at the beginning of each line!
# Our first function is the print() function
# Click this cell and then the run function to print "Hello, world!"

print("Hello, world!")

In [None]:
#now try printing arithmetic instead of a string.
print(2+2)

In [None]:
##try printing a string (inside quotation marks) yourself here and run the cell.
print()


<a id='arithmetic'></a>
### Arithmetic

In [None]:
# Computers are really good at arithmetic
# Addition

2+5

In [None]:
# Let's have Python report the results from three operations at the same time
# Use the print function to see ALL the resRults.

print(2-5)
print(2*5)
print(2/5)

In [None]:
## Take 2 minutes and run some algebra here.
# Use the print function to see ALL the results!



# The Pandas Dataframe

******************************
In Python, Pandas is a <strong>library</strong> that can convert CSV files to dataframes. 

## Our Data
The data we'll analyze today is found in the Azure Notebook folder called "education_dataset.csv). You do <em>not</em> need to open this file. Instead, we are going to use Pandas to read the file into our Juputer Notebook. This dataset comes from:

National Center for Education Statistics, United States Department of Education. (2009). Early Childhood Longitudinal Study, Kindergarten Class of 1998-99 (ECLS-K) [Data file]. Available from http://nces.ed.gov/ecls/kindergarten.asp

I selected five variables (columns) to analyze:

* reading_score = READING IRT SCALE SCORE
* math_score = MATH IRT SCALE SCORE
* knowledge_score = GENERAL KNOWLEDGE IRT SCALE SCORE
* p2income = TOTAL HOUSEHOLD INCOME
* incomecat = INCOME CATEGORES
    * 1 = low income: < \$40,000
    * 2 = mid income
    * 3 = high income: >= \$70,000
    
The unit of observation (row) is the individual kindergartner.  
   
## Motivating Questions

1. Are math, reading, and general knowledge scores related to household income in any predictable way?
2. Can you predict general knowledge scores from reading or math scores? That is, are reading and math skills related to general knowledge?

In [None]:
# Import our Pandas library 

import pandas

In [None]:
# Create a variable, "df" which is the Pandas function reading in our dataset. 
df = pandas.read_csv("education_dataset.csv", sep=',')

# By just typing the variable name and running the cell, we will be able to see what the variable looks like
df

In [None]:
# It's easier to view the information when you have a piece of it instead of the entire thing, 
# Especially when you have a LOT of data
# The "head" function will show you the first 10 rows of our data

df.head(10)

In [None]:
# How to extract a specific column

df['reading_score'].head(20)

In [None]:
# Try it yourself: Extract the first 20 rows of knowledge score column


In [None]:
# To extract one row: notice the syntax and the ZERO (Python starts at 0)
df.loc[0]

In [None]:
## Try it yourself - extract the 20th row of data. Remember, Python starts at 0


## Summary Statistics

Now that you have learned how to read in your dataset and look at some of the information about your dataset, including specific columns and certain row numbers, let's learn some basic statistics about our dataset!

As a reminder, <strong>df</strong> is the variable name of our dataset. Anytime you type "df", you are letting Python know you are using that dataset.

Here is a list of the functions we will look at:
- .mean() finds the mean, or average, of all the numbers in a particular column or each column from the dataset
- .sum() finds the sum of all the numbers in a particular column or each column from the dataset
- .std() finds the standard deviation of all the numbers in a particular column or each column from the dataset
- .describe() finds the a collection of statistical information from a particular column or the entire dataset

In [None]:
# First, let's find the mean, sum, and standard deviation of every column
# I will use the print() function to show ALL the outputs

print("MEAN") #I am printing a line before each function so we know which output belongs with which function
print(df.mean())
print("SUM")
print(df.sum())
print("STANDARD DEVIATION")
print(df.std())

In [None]:
## Summary statistics: the average reading score

df['reading_score'].mean()

In [None]:
# Summary statistics: the sum of the reading scores

df['reading_score'].sum()

In [None]:
#Standard deviation: the standard deviation of the reading scores

df['reading_score'].std()

In [None]:
# Try it yourself - find the MEAN, SUM, and STANDARD DEVIATION of another column that interests you!
# Hint: find the column names above
# Hint 2: if your output is only showing one result, use the 'print' function (put your function inside the print parenthesis)



In [None]:
# We can find it all at the same time. Simply run this cell!

df.describe()

# Visualization

Python can also make quick and easy visualizations. We will be using the library MatPlotLib, which makes quick plots. Run the cells again below.

Here are a list of the important functions:
- .hist() is a histogram of either an entire dataframe or a particular column in a dataframe
- .plot() is a basic plot, but we can add attributes in this function, such as the <strong>kind</strong> of plot and the <strong>x</strong> and <strong>y</strong> axes.
- .groupby() groups your dataframe by particular columns or information. This is a bit more of an advanced technique. 

In [None]:
import matplotlib.pyplot as plt

In [None]:
# The hist() function shows a histogram of ALL the columns 
df.hist()

In [None]:
#That's not pretty. Let's show just ONE column

df['knowledge_score'].hist()

In [None]:
# Try doing a histogram with another column that interests you



In [None]:
# Other options: You can make a scatter plot! 
# We tell Python the "kind" of plot is a "scatter" and that the X and Y axes are particular columns 
# Our question: is math and reading scores correlated?

df.plot(kind='scatter', x = 'reading_score', y = 'math_score')


In [None]:
## Finish up by looking at scatterplots of different relationships on your own
## What are some patterns you find? Ex: Is there a relationship between scores and household income?



In [None]:
# Advanced: Pandas groupby function. This is just a sample of getting more focused while coding.
# Create a new dataframe that is grouped by income category

df_grouped = df.groupby('incomecat')
df_grouped 

In [None]:
df_grouped_mean = df_grouped.mean()
df_grouped_mean

In [None]:
df_grouped_mean[['reading_score', 'math_score', 'knowledge_score']].plot(kind='bar')
plt.legend(loc=9, bbox_to_anchor=(0.5, -0.2), ncol = 3)
plt.show()

## Want to learn more?

If you are interested in Computational Social Science, data analytics, ethical implications, and any of the topics we covered today, we encourage you to begin looking at potential courses or minors you might pursue!

- Computational Social Science minor
- Digital Minor
- Combined major in Computer Science and CSSH
- Other courses you might take: DS 2000/DS 20001 (Data Science) 


# Thank You!

If you have questions, contact us at:

### Cara Marta Messina
Digital Integration Teaching Initiative<br/>
Assistant Director<br/>
messina.c@husky.neu.edu

### Slides, handouts, and data available at LINK
### Schedule an appointment with us! [https://bit.ly/diti-office-hours](https://bit.ly/diti-office-hours)