In [2]:
# This lecture goes through Dataframes, Devops and Github

In [3]:
# https://www.youtube.com/watch?v=_I94-tJlovg

In [4]:
#Devops is writing code with the intention to deploy it. This means code is written in shorter chunks, with templates which
#help manage the code.

#Long story short, DevOps are the people who help the developers (e.g. data scientists) and IT work together.

In [5]:
# In Agile style code development, the product keeps developing in bits and pieces over an indefinite time period. 
# For such applications, some form of version control would be really useful. Git is the form that is used.

In [6]:
# https://towardsdatascience.com/devops-for-data-scientists-taming-the-unicorn-6410843990de
# Devops, Testing as well as Security are important topics to be aware of in the data science space. 

In [7]:
#https://github.com/luciasantamaria/pandas-tutorial/blob/master/intro-to-pandas-1-Data-Structures.ipynb
#Let's look at Pandas.

Introduction to pandas¶
Adapted from original materials by Greg Reda.

What is it?
pandas is an open source Python library for data analysis. Python has always been great for prepping and munging data, but historically it has not been great for analysis - you'd usually end up using R or loading it into a database and using SQL (or worse, Excel). pandas makes Python great for analysis.

In [8]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display
# Switch to truncated view if too many columns
pd.set_option('max_columns', 50)
%matplotlib inline

Data Structures¶
pandas introduces two new data structures to Python - Series and DataFrame, both of which are built on top of NumPy (this means it's fast).

Series
A Series is a one-dimensional object similar to an array, list, or column in a table. It will assign a labeled index to each item in the Series. By default, each item will receive an index label from 0 to N, where N is the length of the Series minus one.

# create a Series with an arbitrary list
s = pd.Series([7, 'Heisenberg', 3.14, -1789710578, 'Happy Eating!'])
s

In [9]:
s = pd.Series([7, 'Heisenberg', 3.14, -1789710578, 'Happy Eating!'],
              index=['A', 'Z', 'C', 'Y', 'E'])
s

A                7
Z       Heisenberg
C             3.14
Y      -1789710578
E    Happy Eating!
dtype: object

In [10]:
d = {'Chicago': 1000, 'New York': 1300, 'Portland': 900, 'San Francisco': 1100,
     'Austin': 450, 'Boston': None}
cities = pd.Series(d)
cities

Chicago          1000.0
New York         1300.0
Portland          900.0
San Francisco    1100.0
Austin            450.0
Boston              NaN
dtype: float64

In [11]:
cities['Chicago']

1000.0

In [12]:
cities[['Chicago', 'Portland', 'San Francisco']]

Chicago          1000.0
Portland          900.0
San Francisco    1100.0
dtype: float64

In [13]:
cities[cities < 1000]

Portland    900.0
Austin      450.0
dtype: float64

In [14]:
less_than_1000 = cities < 1000
print(less_than_1000)
print('\n')
print(cities[less_than_1000])

Chicago          False
New York         False
Portland          True
San Francisco    False
Austin            True
Boston           False
dtype: bool


Portland    900.0
Austin      450.0
dtype: float64


In [15]:
# changing based on the index
print('Old value:', cities['Chicago'])
cities['Chicago'] = 1400
print('New value:', cities['Chicago'])

Old value: 1000.0
New value: 1400.0


In [16]:
cities

Chicago          1400.0
New York         1300.0
Portland          900.0
San Francisco    1100.0
Austin            450.0
Boston              NaN
dtype: float64

In [17]:
# changing values using boolean logic
print(cities[cities < 1000])
print('\n')
cities[cities < 1000] = 750

print(cities[cities < 1000])

Portland    900.0
Austin      450.0
dtype: float64


Portland    750.0
Austin      750.0
dtype: float64


In [18]:
print('Seattle' in cities)
print('San Francisco' in cities)

False
True


In [19]:
# changing values using boolean logic
print(cities[cities < 1000])
print('\n')
cities[cities < 1000] = 750

print(cities[cities < 1000])

Portland    750.0
Austin      750.0
dtype: float64


Portland    750.0
Austin      750.0
dtype: float64


In [20]:
# divide city values by 3
cities/3

Chicago          466.666667
New York         433.333333
Portland         250.000000
San Francisco    366.666667
Austin           250.000000
Boston                  NaN
dtype: float64

In [21]:
# square city values
np.square(cities)

Chicago          1960000.0
New York         1690000.0
Portland          562500.0
San Francisco    1210000.0
Austin            562500.0
Boston                 NaN
dtype: float64

In [22]:
print(cities[['Chicago', 'New York', 'Portland']])
print('\n')
print(cities[['Austin', 'New York']])
print('\n')
print(cities[['Chicago', 'New York', 'Portland']] + cities[['Austin', 'New York']])

Chicago     1400.0
New York    1300.0
Portland     750.0
dtype: float64


Austin       750.0
New York    1300.0
dtype: float64


Austin         NaN
Chicago        NaN
New York    2600.0
Portland       NaN
dtype: float64


In [23]:
# returns a boolean series indicating which values aren't NULL
print(cities)
cities.notnull()

Chicago          1400.0
New York         1300.0
Portland          750.0
San Francisco    1100.0
Austin            750.0
Boston              NaN
dtype: float64


Chicago           True
New York          True
Portland          True
San Francisco     True
Austin            True
Boston           False
dtype: bool

In [24]:
# use boolean logic to grab the NULL cities
print(cities.isnull())
print('\n')
print(cities[cities.isnull()])

Chicago          False
New York         False
Portland         False
San Francisco    False
Austin           False
Boston            True
dtype: bool


Boston   NaN
dtype: float64


In [26]:
print('Seattle' in cities)
print('San Francisco' in cities)

False
True


In [27]:
# divide city values by 3
cities/3

Chicago          466.666667
New York         433.333333
Portland         250.000000
San Francisco    366.666667
Austin           250.000000
Boston                  NaN
dtype: float64

In [28]:
# square city values
np.square(cities)

Chicago          1960000.0
New York         1690000.0
Portland          562500.0
San Francisco    1210000.0
Austin            562500.0
Boston                 NaN
dtype: float64

In [29]:
print(cities[['Chicago', 'New York', 'Portland']])
print('\n')
print(cities[['Austin', 'New York']])
print('\n')
print(cities[['Chicago', 'New York', 'Portland']] + cities[['Austin', 'New York']])

Chicago     1400.0
New York    1300.0
Portland     750.0
dtype: float64


Austin       750.0
New York    1300.0
dtype: float64


Austin         NaN
Chicago        NaN
New York    2600.0
Portland       NaN
dtype: float64


In [30]:
# returns a boolean series indicating which values aren't NULL
print(cities)
cities.notnull()

Chicago          1400.0
New York         1300.0
Portland          750.0
San Francisco    1100.0
Austin            750.0
Boston              NaN
dtype: float64


Chicago           True
New York          True
Portland          True
San Francisco     True
Austin            True
Boston           False
dtype: bool

In [31]:
# use boolean logic to grab the NULL cities
print(cities.isnull())
print('\n')
print(cities[cities.isnull()])

Chicago          False
New York         False
Portland         False
San Francisco    False
Austin           False
Boston            True
dtype: bool


Boston   NaN
dtype: float64


DataFrame¶
A DataFrame is a tablular data structure comprised of rows and columns, akin to a spreadsheet, database table, or R's data.frame object. You can also think of a DataFrame as a group of Series objects that share an index (the column names).

For the rest of the tutorial, we'll be primarily working with DataFrames.

Reading Data
To create a DataFrame out of common Python data structures, we can pass a dictionary of lists to the DataFrame constructor.

Using the columns parameter allows us to tell the constructor how we'd like the columns ordered. By default, the DataFrame constructor will order the columns alphabetically (though this isn't the case when reading from a file - more on that next).

In [32]:
data = {'year': [2010, 2011, 2012, 2011, 2012, 2010, 2011, 2012],
        'team': ['Bears', 'Bears', 'Bears', 'Packers', 'Packers', 'Lions', 'Lions', 'Lions'],
        'wins': [11, 8, 10, 15, 11, 6, 10, 4],
        'losses': [5, 8, 6, 1, 5, 10, 6, 12]}
football = pd.DataFrame(data, columns=['year', 'team', 'wins', 'losses'])
football

Unnamed: 0,year,team,wins,losses
0,2010,Bears,11,5
1,2011,Bears,8,8
2,2012,Bears,10,6
3,2011,Packers,15,1
4,2012,Packers,11,5
5,2010,Lions,6,10
6,2011,Lions,10,6
7,2012,Lions,4,12
