# Introduction to Python and Jupyter Notebooks Review

To begin, be sure you understand how to move between cells in a Jupyter notebook and change them from code to markdown.  If you want additional work with styling markdown cells, please see the [cheatsheet](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet).  In this part of the notebook, we will review some numpy basics and create some simple plots with Matplotlib.

In [1]:
%%HTML
<iframe width="560" height="315" src="https://www.youtube.com/embed/T8JGn4JRy4g?ecver=1" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe>

In [2]:
%matplotlib notebook
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

### NumPy and Matplotlib

To begin, let's play with some basic `matplotlib` plots and the NumPy random methods. For more information please consult the documentation [here](https://docs.scipy.org/doc/numpy-1.14.0/reference/routines.random.html). 

In [3]:
a = np.random.randint(1, 20, 100)

In [4]:
plt.figure()
plt.hist(a)

<IPython.core.display.Javascript object>

(array([ 9., 12., 11.,  6.,  7.,  6., 14., 17.,  8., 10.]),
 array([ 1. ,  2.8,  4.6,  6.4,  8.2, 10. , 11.8, 13.6, 15.4, 17.2, 19. ]),
 <a list of 10 Patch objects>)

In [5]:
b = np.random.random(100)
c = np.random.normal(5, 10, 100)
d = np.random.binomial(100, .3, 100)

In [6]:
np.random.binomial?

In [7]:
a[:5]

array([ 9,  5, 13,  1,  5])

In [8]:
plt.figure(figsize = (9, 6))

plt.subplot(2, 2, 1)
plt.hist(a)
plt.title("Random Integers")

plt.subplot(2, 2, 2)
plt.hist(b, color = 'green')
plt.title("Random Floats")

plt.subplot(2, 2, 3)
plt.hist(c, color = 'grey')
plt.title("Normal Distribution")

plt.subplot(2, 2, 4)
plt.hist(d, color = 'orange')
plt.title("Binomial Distribution")

<IPython.core.display.Javascript object>

Text(0.5,1,'Binomial Distribution')

In [8]:
plt.figure()
plt.scatter(c, d)
plt.title("Scatter Plot", loc = 'left')
plt.xticks([])
plt.yticks([])

<IPython.core.display.Javascript object>

([], <a list of 0 Text yticklabel objects>)

In [9]:
dists = [a, b, c, d]
plt.figure()
plt.boxplot(dists)
plt.title("Boxplots of Distributions", loc = "right")

<IPython.core.display.Javascript object>

Text(1,1,'Boxplots of Distributions')

In [10]:
import seaborn as sns

plt.figure()
for i in [a,c,d]:
    sns.distplot(i, hist = False)

<IPython.core.display.Javascript object>

### Loading Data: Intro to Pandas

Now, we use the Pandas library to examine a variety of datasets.  Below, I create four different `DataFrame` objects from files.  The first three are from `.csv` files located in our **data** directory.  The final, is through the API from NYCOpenData.  We will continue to visit methods of accessing and structuring data, but to begin we use these two popular options.  

To load the `.csv` files, we provide Pandas with a path or url in the `.read_csv()` method.  I load all four datasets in what follows.

In [11]:
%%HTML
<iframe width="560" height="315" src="https://www.youtube.com/embed/9Dsg9DQAU_g?ecver=1" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe>

In [12]:
nyc311data = pd.read_json('https://data.cityofnewyork.us/resource/fhrw-4uyv.json')

In [13]:
nyc311data.columns

Index(['address_type', 'agency', 'agency_name', 'bbl', 'borough', 'city',
       'closed_date', 'community_board', 'complaint_type', 'created_date',
       'cross_street_1', 'cross_street_2', 'descriptor', 'due_date',
       'facility_type', 'incident_address', 'incident_zip',
       'intersection_street_1', 'intersection_street_2', 'latitude',
       'location', 'location_type', 'longitude', 'open_data_channel_type',
       'park_borough', 'park_facility_name', 'resolution_action_updated_date',
       'resolution_description', 'status', 'street_name',
       'taxi_pick_up_location', 'unique_key', 'x_coordinate_state_plane',
       'y_coordinate_state_plane'],
      dtype='object')

In [14]:
nyc311data.dtypes

address_type                       object
agency                             object
agency_name                        object
bbl                               float64
borough                            object
city                               object
closed_date                        object
community_board                    object
complaint_type                     object
created_date                       object
cross_street_1                     object
cross_street_2                     object
descriptor                         object
due_date                           object
facility_type                      object
incident_address                   object
incident_zip                      float64
intersection_street_1              object
intersection_street_2              object
latitude                          float64
location                           object
location_type                      object
longitude                         float64
open_data_channel_type            

In [15]:
nyc311data.describe()

Unnamed: 0,bbl,incident_zip,latitude,longitude,unique_key,x_coordinate_state_plane,y_coordinate_state_plane
count,788.0,994.0,992.0,992.0,1000.0,992.0,992.0
mean,2821326000.0,10871.207243,40.733111,-73.914738,39546780.0,1007873.0,206390.224798
std,1204321000.0,546.764302,0.082205,0.079548,2405.281,22059.89,29947.421808
min,1000330000.0,10001.0,40.511559,-74.242685,39542680.0,916768.0,125745.0
25%,2028605000.0,10453.0,40.675169,-73.956432,39544620.0,996336.0,185261.0
50%,3026485000.0,11209.5,40.719661,-73.92055,39546700.0,1006240.0,201471.0
75%,4016903000.0,11364.75,40.801983,-73.866021,39548840.0,1021327.0,231463.25
max,5073550000.0,11694.0,40.907142,-73.729944,39550840.0,1059177.0,269790.0


In [16]:
complaints = nyc311data[['complaint_type', 'borough', 'agency', 'agency_name']]

In [17]:
complaints.head()

Unnamed: 0,complaint_type,borough,agency,agency_name
0,Request Large Bulky Item Collection,BROOKLYN,DSNY,Department of Sanitation
1,Request Large Bulky Item Collection,MANHATTAN,DSNY,Department of Sanitation
2,Noise - Residential,QUEENS,NYPD,New York City Police Department
3,Noise - Residential,QUEENS,NYPD,New York City Police Department
4,Noise - Residential,QUEENS,NYPD,New York City Police Department


In [18]:
complaints.groupby(by = 'borough').size()

borough
BRONX            156
BROOKLYN         298
MANHATTAN        200
QUEENS           299
STATEN ISLAND     42
Unspecified        5
dtype: int64

In [19]:
complaints[complaints['borough'] =='BROOKLYN'].sort_values('complaint_type')[:10]

Unnamed: 0,complaint_type,borough,agency,agency_name
79,Blocked Driveway,BROOKLYN,NYPD,New York City Police Department
554,Blocked Driveway,BROOKLYN,NYPD,New York City Police Department
386,Blocked Driveway,BROOKLYN,NYPD,New York City Police Department
191,Blocked Driveway,BROOKLYN,NYPD,New York City Police Department
931,Blocked Driveway,BROOKLYN,NYPD,New York City Police Department
284,Blocked Driveway,BROOKLYN,NYPD,New York City Police Department
272,Blocked Driveway,BROOKLYN,NYPD,New York City Police Department
392,Blocked Driveway,BROOKLYN,NYPD,New York City Police Department
448,Blocked Driveway,BROOKLYN,NYPD,New York City Police Department
712,Blocked Driveway,BROOKLYN,NYPD,New York City Police Department


In [20]:
BK_COMPLAIN = complaints[complaints['borough'] == 'BROOKLYN']['complaint_type'].value_counts()

In [21]:
plt.figure(figsize = (7, 5))
plt.bar(BK_COMPLAIN.index[:6], BK_COMPLAIN[:6])

<IPython.core.display.Javascript object>

<BarContainer object of 6 artists>

In [23]:
plt.tick_params(labelrotation = 20)

In [22]:
plt.rcParams["font.family"] = "fantasy"

plt.figure(figsize = (10, 7))
bars = plt.barh(BK_COMPLAIN.index[:5], BK_COMPLAIN[:5])
plt.title("Top 5 311 Complaints in Brooklyn", loc = 'left', fontsize = 16 )

<IPython.core.display.Javascript object>

Text(0,1,'Top 5 311 Complaints in Brooklyn')

In [24]:
labels = BK_COMPLAIN.index

In [25]:
for i in labels[:6]:
    print(i)

Noise - Residential
Noise - Street/Sidewalk
Noise - Commercial
Blocked Driveway
Illegal Parking
Noise - Vehicle


In [28]:
for i in range(5):
    label = labels[i]
    plt.gca().text(2, i, label, color = 'w', fontsize = 10)

In [29]:
plt.tick_params(top = 'off', bottom = 'off', left = 'off', right = 'off', labelleft='off', labelbottom='off')



In [30]:
for spine in plt.gca().spines.values():
    spine.set_visible(False)

In [32]:
plt.savefig('images/brooklyn_complaining.png')

### Titanic Manipulation

In [33]:
titanic = pd.read_csv('data/eda_data/titanic.csv')
titanic.head()

Unnamed: 0,survived,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
0,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [32]:
titanic[titanic.pclass == 3][:5]

Unnamed: 0,survived,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
0,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
7,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S


In [33]:
titanic.sample(frac=0.1)[:5]

Unnamed: 0,survived,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
429,1,3,"Pickard, Mr. Berk (Berk Trembisky)",male,32.0,0,0,SOTON/O.Q. 392078,8.05,E10,S
219,0,2,"Harris, Mr. Walter",male,30.0,0,0,W/C 14208,10.5,,S
567,0,3,"Palsson, Mrs. Nils (Alma Cornelia Berglund)",female,29.0,0,4,349909,21.075,,S
571,1,1,"Appleton, Mrs. Edward Dale (Charlotte Lamson)",female,53.0,2,0,11769,51.4792,C101,S
333,0,3,"Vander Planke, Mr. Leo Edmondus",male,16.0,2,0,345764,18.0,,S


In [34]:
titanic.iloc[4:10]

Unnamed: 0,survived,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
4,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


In [35]:
titanic.nlargest(10, 'age')

Unnamed: 0,survived,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
630,1,1,"Barkworth, Mr. Algernon Henry Wilson",male,80.0,0,0,27042,30.0,A23,S
851,0,3,"Svensson, Mr. Johan",male,74.0,0,0,347060,7.775,,S
96,0,1,"Goldschmidt, Mr. George B",male,71.0,0,0,PC 17754,34.6542,A5,C
493,0,1,"Artagaveytia, Mr. Ramon",male,71.0,0,0,PC 17609,49.5042,,C
116,0,3,"Connors, Mr. Patrick",male,70.5,0,0,370369,7.75,,Q
672,0,2,"Mitchell, Mr. Henry Michael",male,70.0,0,0,C.A. 24580,10.5,,S
745,0,1,"Crosby, Capt. Edward Gifford",male,70.0,1,1,WE/P 5735,71.0,B22,S
33,0,2,"Wheadon, Mr. Edward H",male,66.0,0,0,C.A. 24579,10.5,,S
54,0,1,"Ostby, Mr. Engelhart Cornelius",male,65.0,0,1,113509,61.9792,B30,C
280,0,3,"Duane, Mr. Frank",male,65.0,0,0,336439,7.75,,Q


In [36]:
titanic.nsmallest(10, 'age')

Unnamed: 0,survived,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
803,1,3,"Thomas, Master. Assad Alexander",male,0.42,0,1,2625,8.5167,,C
755,1,2,"Hamalainen, Master. Viljo",male,0.67,1,1,250649,14.5,,S
469,1,3,"Baclini, Miss. Helene Barbara",female,0.75,2,1,2666,19.2583,,C
644,1,3,"Baclini, Miss. Eugenie",female,0.75,2,1,2666,19.2583,,C
78,1,2,"Caldwell, Master. Alden Gates",male,0.83,0,2,248738,29.0,,S
831,1,2,"Richards, Master. George Sibley",male,0.83,1,1,29106,18.75,,S
305,1,1,"Allison, Master. Hudson Trevor",male,0.92,1,2,113781,151.55,C22 C26,S
164,0,3,"Panula, Master. Eino Viljami",male,1.0,4,1,3101295,39.6875,,S
172,1,3,"Johnson, Miss. Eleanor Ileen",female,1.0,1,1,347742,11.1333,,S
183,1,2,"Becker, Master. Richard F",male,1.0,2,1,230136,39.0,F4,S


In [37]:
gender = titanic[['survived', 'sex']]

In [38]:
gender[gender['survived'] == 0].groupby('sex').size()

sex
female     81
male      468
dtype: int64

In [62]:
gender.iloc[gender, ('survived' == 0)] 

TypeError: '>=' not supported between instances of 'int' and 'str'

In [39]:
gender[gender['survived'] == 1].groupby('sex').size()

sex
female    233
male      109
dtype: int64

### Rock Songs

In [34]:
rockin = pd.read_csv('data/eda_data/rocking.csv', index_col = 0)

In [35]:
rockin.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2230 entries, 0 to 2229
Data columns (total 8 columns):
Song Clean      2230 non-null object
ARTIST CLEAN    2230 non-null object
Release Year    1653 non-null object
COMBINED        2230 non-null object
First?          2230 non-null int64
Year?           2230 non-null int64
PlayCount       2230 non-null int64
F*G             2230 non-null int64
dtypes: int64(4), object(4)
memory usage: 156.8+ KB


In [36]:
rockin.head()

Unnamed: 0,Song Clean,ARTIST CLEAN,Release Year,COMBINED,First?,Year?,PlayCount,F*G
0,Caught Up in You,.38 Special,1982.0,Caught Up in You by .38 Special,1,1,82,82
1,Fantasy Girl,.38 Special,,Fantasy Girl by .38 Special,1,0,3,0
2,Hold On Loosely,.38 Special,1981.0,Hold On Loosely by .38 Special,1,1,85,85
3,Rockin' Into the Night,.38 Special,1980.0,Rockin' Into the Night by .38 Special,1,1,18,18
4,Art For Arts Sake,10cc,1975.0,Art For Arts Sake by 10cc,1,1,1,1


In [46]:
rockin = rockin.rename({'First?': 'First', 'Year?': 'Year', 'F*G': 'fg', 'Song Clean': 'song', 'ARTIST CLEAN': 'artist','COMBINED':'song_artist,''Release Year':'releaseyr',}, axis = 1)

SyntaxError: invalid syntax (<ipython-input-46-67877f776fce>, line 1)

In [47]:
null_release_mask = rockin['Release Year'].isnull()
rockin.loc[null_release_mask, 'Release Year'] = 0

In [44]:
rockin.head()

Unnamed: 0,song,artist,Release Year,"song_artist,",First,Year,PlayCount,fg
0,Caught Up in You,.38 Special,1982,Caught Up in You by .38 Special,1,1,82,82
1,Fantasy Girl,.38 Special,0,Fantasy Girl by .38 Special,1,0,3,0
2,Hold On Loosely,.38 Special,1981,Hold On Loosely by .38 Special,1,1,85,85
3,Rockin' Into the Night,.38 Special,1980,Rockin' Into the Night by .38 Special,1,1,18,18
4,Art For Arts Sake,10cc,1975,Art For Arts Sake by 10cc,1,1,1,1


In [49]:
rockin['artist'].unique()[::10]

array(['.38 Special', 'Alannah Myles', 'Arlo Guthrie', 'Badfinger',
       'Billy Squier', 'Bob Dylan', 'Bruce Hornsby & The Range',
       'Charlie Daniels Band', 'Counting Crows', 'Dave Mason',
       'Derek & The Dominos', 'Don McLean', 'Edgar Winter Group',
       'Eurythmics', 'Fleetwood Mac', 'Genesis', "Gov't Mule",
       'Harold Faltermeyer', 'Hooters', 'James Gang', 'Jethro Tull',
       'John Fogerty', 'Junkyard', 'Lake', 'Local H', 'Meat Loaf',
       'Molly Hatchet', 'Neil Young', 'Ozark Mountain Daredevils',
       'Peter Gabriel', 'Queensryche', 'Rick Derringer', 'Roger Daltry',
       'Santana', 'Skid Row', 'Squeeze', 'Steve Winwood', 'Stu Nunnery',
       'Taxxi', "The B-52's", 'The Clash', 'The Kinks', 'The Outlaws',
       'The Traveling Wilburys', 'Tom Cochrane', 'U2', 'Warren Zevon',
       'Y&T'], dtype=object)

In [50]:
rockin.hist()

<IPython.core.display.Javascript object>

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x1a21fba860>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a22077a20>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x1a220ae2b0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a220e7da0>]],
      dtype=object)

In [55]:
rockin.count()

song            2230
artist          2230
Release Year    2230
song_artist,    2230
First           2230
Year            2230
PlayCount       2230
fg              2230
dtype: int64

In [56]:
rockin.Year

0       1
1       0
2       1
3       1
4       1
5       1
6       1
7       1
8       1
9       1
10      0
11      1
12      1
13      0
14      1
15      1
16      0
17      1
18      1
19      1
20      1
21      1
22      1
23      1
24      0
25      0
26      0
27      1
28      1
29      0
       ..
2200    1
2201    1
2202    1
2203    0
2204    1
2205    0
2206    0
2207    1
2208    0
2209    1
2210    1
2211    1
2212    1
2213    1
2214    1
2215    0
2216    0
2217    1
2218    0
2219    1
2220    1
2221    0
2222    0
2223    1
2224    1
2225    0
2226    1
2227    1
2228    1
2229    1
Name: Year, Length: 2230, dtype: int64

In [65]:
rockin.groupby('artist')['song'].count()

artist
.38 Special                  4
10cc                         1
3 Doors Down                 3
4 Non Blondes                1
AC/DC                       29
Ace                          1
Adelitas Way                 1
Aerosmith                   31
Alanis Morissette            2
Alannah Myles                1
Aldo Nova                    1
Alice Cooper                10
Alice In Chains              6
Allman Brothers Band        13
America                      3
Animals II                   2
Ann Wilson                   1
April Wine                   1
Argent                       1
Arlo Guthrie                 1
Artful Dodger                1
Asia                         2
Atlanta Rhythm Section       2
Audioslave                   1
Autograph                    2
Axe                          1
Bachman-Turner Overdrive     4
Bad Company                 14
Bad English                  1
Badfinger                    1
                            ..
Trapeze                      1
T

In [79]:
rockin.[['artist','PlayCount']].groupby('artist').sum()

SyntaxError: invalid syntax (<ipython-input-79-97954e2bd48a>, line 1)

In [80]:
rockin.groupby('releaseyr')

KeyError: 'releaseyr'