# Crash Course in Python libraries for Data Analysis

## Intro

- Why am I choosing Python?

## Tools

- Anaconda
    - Python 2.7
    - Jupyter Notebook
- Github (optional)

## Libraries

- numpy
    - stands for "numeric[al] Python"
    - advantage over "vanilla" Python is the ability to perform mathmatical operations over entire datasets without writing loops
- pandas
    - main Python library for data analysis
    - created by Wes McKinney (author of "Python for Data Analysis") to "mimic" the functionality of R in Python
- matplotlib
    - graphing and data visualization module for Python

## numpy

In [1]:
import numpy as np

cvalues = [25.3, 24.8, 26.9, 23.9]

# create array
C = np.array(cvalues)
print(C)

[ 25.3  24.8  26.9  23.9]


Numpy arrays are a grid of elements of the same data type

In [2]:
print(C.dtype)

float64


In [4]:
#perform math over entire array
print(C * 9 / 5 + 32)

[ 77.54  76.64  80.42  75.02]


In [3]:
# shape of array listed by (rows, columns)
b = np.array([[1,2,3], [4,5,6]])
print(b.shape)

(2L, 3L)


## pandas

- has 2 data structures, Series and DataFrame, built on top of numpy
- Series: one-dimensional object similar to an array, list, or column
- DataFrame: tabular data structure made of rows and columns, similar to a spreadsheet or database table; like a group of Series objects

In [7]:
import pandas as pd

#read in data into dataframe (can also read from other sources)
reviews = pd.read_csv("ign.csv")

In [8]:
#see first 5 rows
reviews.head()

Unnamed: 0.1,Unnamed: 0,score_phrase,title,url,platform,score,genre,editors_choice,release_year,release_month,release_day
0,0,Amazing,LittleBigPlanet PS Vita,/games/littlebigplanet-vita/vita-98907,PlayStation Vita,9.0,Platformer,Y,2012,9,12
1,1,Amazing,LittleBigPlanet PS Vita -- Marvel Super Hero E...,/games/littlebigplanet-ps-vita-marvel-super-he...,PlayStation Vita,9.0,Platformer,Y,2012,9,12
2,2,Great,Splice: Tree of Life,/games/splice/ipad-141070,iPad,8.5,Puzzle,N,2012,9,12
3,3,Great,NHL 13,/games/nhl-13/xbox-360-128182,Xbox 360,8.5,Sports,N,2012,9,11
4,4,Great,NHL 13,/games/nhl-13/ps3-128181,PlayStation 3,8.5,Sports,N,2012,9,11


In [9]:
#can also see dataframe shape
reviews.shape

(18625, 11)

In [14]:
#and data types for each column
reviews.dtypes

Unnamed: 0          int64
score_phrase       object
title              object
url                object
platform           object
score             float64
genre              object
editors_choice     object
release_year        int64
release_month       int64
release_day         int64
dtype: object

In [15]:
#basic statistics about dataframe's numeric-type columns
reviews.describe()

Unnamed: 0.1,Unnamed: 0,score,release_year,release_month,release_day
count,18625.0,18625.0,18625.0,18625.0,18625.0
mean,9312.0,6.950459,2006.515329,7.13847,15.603866
std,5376.718717,1.711736,4.587529,3.47671,8.690128
min,0.0,0.5,1970.0,1.0,1.0
25%,4656.0,6.0,2003.0,4.0,8.0
50%,9312.0,7.3,2007.0,8.0,16.0
75%,13968.0,8.2,2010.0,10.0,23.0
max,18624.0,10.0,2016.0,12.0,31.0


Can also use `.min`, `.max`, `.mean`, `.median`, `.count`, `.std` on individual columns/elements

In [17]:
#find correlation between variables
reviews.corr()

Unnamed: 0.1,Unnamed: 0,score,release_year,release_month,release_day
Unnamed: 0,1.0,0.035579,0.893394,-0.096676,0.010068
score,0.035579,1.0,0.062716,0.007632,0.020079
release_year,0.893394,0.062716,1.0,-0.115515,0.016867
release_month,-0.096676,0.007632,-0.115515,1.0,-0.067964
release_day,0.010068,0.020079,0.016867,-0.067964,1.0


Selecting rows in a dataframe

In [18]:
#by position
some_reviews = reviews.iloc[10:20,]
some_reviews

Unnamed: 0.1,Unnamed: 0,score_phrase,title,url,platform,score,genre,editors_choice,release_year,release_month,release_day
10,10,Good,Tekken Tag Tournament 2,/games/tekken-tag-tournament-2/ps3-124584,PlayStation 3,7.5,Fighting,N,2012,9,11
11,11,Good,Tekken Tag Tournament 2,/games/tekken-tag-tournament-2/xbox-360-124581,Xbox 360,7.5,Fighting,N,2012,9,11
12,12,Good,Wild Blood,/games/wild-blood/iphone-139363,iPhone,7.0,,N,2012,9,10
13,13,Amazing,Mark of the Ninja,/games/mark-of-the-ninja-135615/xbox-360-129276,Xbox 360,9.0,"Action, Adventure",Y,2012,9,7
14,14,Amazing,Mark of the Ninja,/games/mark-of-the-ninja-135615/pc-143761,PC,9.0,"Action, Adventure",Y,2012,9,7
15,15,Okay,Home: A Unique Horror Adventure,/games/home-a-unique-horror-adventure/mac-2001...,Macintosh,6.5,Adventure,N,2012,9,6
16,16,Okay,Home: A Unique Horror Adventure,/games/home-a-unique-horror-adventure/pc-137135,PC,6.5,Adventure,N,2012,9,6
17,17,Great,Avengers Initiative,/games/avengers-initiative/iphone-141579,iPhone,8.0,Action,N,2012,9,5
18,18,Mediocre,Way of the Samurai 4,/games/way-of-the-samurai-4/ps3-23516,PlayStation 3,5.5,"Action, Adventure",N,2012,9,3
19,19,Good,JoJo's Bizarre Adventure HD,/games/jojos-bizarre-adventure/xbox-360-137717,Xbox 360,7.0,Fighting,N,2012,9,3


In [24]:
#by label
#will purposefully throw an error because there is no label 9
some_reviews.loc[9]

KeyError: 'the label [9] is not in the [index]'

In [20]:
#works for index 9 in some_reviews
some_reviews.iloc[9]

Unnamed: 0                                                    19
score_phrase                                                Good
title                                JoJo's Bizarre Adventure HD
url               /games/jojos-bizarre-adventure/xbox-360-137717
platform                                                Xbox 360
score                                                          7
genre                                                   Fighting
editors_choice                                                 N
release_year                                                2012
release_month                                                  9
release_day                                                    3
Name: 19, dtype: object

In [23]:
#by conditions
xbox_one_filter = (reviews["score"] > 7) & (reviews["platform"] == "Xbox One")
filtered_reviews = reviews[xbox_one_filter]
filtered_reviews.head()

Unnamed: 0.1,Unnamed: 0,score_phrase,title,url,platform,score,genre,editors_choice,release_year,release_month,release_day
17137,17137,Amazing,Gone Home,/games/gone-home/xbox-one-20014361,Xbox One,9.5,Simulation,Y,2013,8,15
17197,17197,Amazing,Rayman Legends,/games/rayman-legends/xbox-one-20008449,Xbox One,9.5,Platformer,Y,2013,8,26
17295,17295,Amazing,LEGO Marvel Super Heroes,/games/lego-marvel-super-heroes/xbox-one-20000826,Xbox One,9.0,Action,Y,2013,10,22
17313,17313,Great,Dead Rising 3,/games/dead-rising-3/xbox-one-124306,Xbox One,8.3,Action,N,2013,11,18
17317,17317,Great,Killer Instinct,/games/killer-instinct-2013/xbox-one-20000538,Xbox One,8.4,Fighting,N,2013,11,18


## matplotlib

In [25]:
from matplotlib import pyplot as plt
from matplotlib import style

style.use('ggplot')

x = [5,8,10]
y = [12,16,6]

x2 = [6,9,11]
y2 = [6,15,7]

In [31]:
#plot line graph
plt.plot(x,y,linewidth=5)
plt.plot(x2,y2,linewidth=5)

plt.title('Epic Info')
plt.ylabel('Y axis')
plt.xlabel('X axis')

plt.show()

In [27]:
#plot bar graph
plt.bar(x, y, align='center')
plt.bar(x2, y2, color='g', align='center')

plt.title('Epic Info')
plt.ylabel('Y axis')
plt.xlabel('X axis')

plt.show()

In [28]:
#plot scatterplot
plt.scatter(x, y)#, align='center')
plt.scatter(x2, y2, color='g')#, align='center')

plt.title('Epic Info')
plt.ylabel('Y axis')
plt.xlabel('X axis')

plt.show()

## Resources

http://www.python-course.eu/numpy.php

https://www.dataquest.io/blog/pandas-python-tutorial/

https://pythonprogramming.net/matplotlib-python-3-basics-tutorial/