This command imports the class Counter from the collections library. Counter is a very useful tool for data scientists; it can count the number of times items appear in collections such as lists. For example, in the code below we will create a list of marriage ages. Using Counter we can quickly count the number of times each unique age appears.

In [1]:
from collections import Counter

In [2]:
from collections import Counter
marriage_ages = [22, 22, 25, 25, 30, 24, 26, 24, 35]  # create a list
value_counts = Counter(marriage_ages)  # apply the counter functionality
print(value_counts.most_common())

[(22, 2), (25, 2), (24, 2), (30, 1), (26, 1), (35, 1)]


In [3]:
def add_two_numbers(x, y):  # function header
    """
    Takes in two numbers and returns the sum
    parameters
        x : str
            first number
        y : str
            second number
    returns
        x+y
    """
    z = x + y
    return z  # function return
print(add_two_numbers(100,5))  # function  call

105


# List


In [4]:
# This creates the list
depths = [1, 5, 3, 6, 4, 7, 10, 12]

# This outputs the first 5 elements. No number before the : implies 0
first_5_depths = depths[:5]

print("---0---")
print(first_5_depths)

# You can easily sum
print("---1---")
print(sum(depths))

# And take the max
print("---2---")
print(max(depths))

# Slicing with a negative starts from the end, so this returns the last element
print("---3---")
print(depths[-1])

# This returns the end of the list starting from the second to the end
# Nothing after the : implies the end of the list
print("---4---")
print(depths[-2:])

# This returns the second, third, and forth elements
# Remember counting starts at zero!
print("---5---")
print(depths[2:5])

# These commands check if a value is contained in the list
print("---6---")
print(22 in depths)
print(1 in depths)

# This is how you add another value to the end of your list
depths.append(44)
print("---7---")
print(depths)

# You can extend a list with another list
depths.extend([100, 200])
print("---8---")
print(depths)

# You can also modify a value
# This replaces the 4th value with 100
depths[4] = 100
print("---9---")
print(depths)

# Or you can do insert to accomplish the same thing
depths.insert(5, 1000)
print("---10---")
print(depths)

---0---
[1, 5, 3, 6, 4]
---1---
48
---2---
12
---3---
12
---4---
[10, 12]
---5---
[3, 6, 4]
---6---
False
True
---7---
[1, 5, 3, 6, 4, 7, 10, 12, 44]
---8---
[1, 5, 3, 6, 4, 7, 10, 12, 44, 100, 200]
---9---
[1, 5, 3, 6, 100, 7, 10, 12, 44, 100, 200]
---10---
[1, 5, 3, 6, 100, 1000, 7, 10, 12, 44, 100, 200]


# Dictionary

In [5]:
# Initialize the dictionary.
# Keys are first then a : then the value
my_dict = {"age": 22, "birth_year": 1999, "name": "jack", "siblings": ["jill", "jen"]}

# Get the value for the key age
print("---0---")
print(my_dict['age'])

# Check is age is a key
print("---1---")
print('age' in my_dict)

# Check is company is a key
print("---2---")
print('company' in my_dict)

# Get the value for they key age
print("---3---")
print(my_dict.get('age'))

# Get the value for they key company
# If it doesn't exsist, return 1
print("---4---")
print(my_dict.get('company', 1))

# Return all the keys
print("---5---")
print(my_dict.keys())

# Return all the values
print("---6---")
print(my_dict.values())

# Return all the key, value pairs
print("---7---")
print(my_dict.items())

---0---
22
---1---
True
---2---
False
---3---
22
---4---
1
---5---
dict_keys(['age', 'birth_year', 'name', 'siblings'])
---6---
dict_values([22, 1999, 'jack', ['jill', 'jen']])
---7---
dict_items([('age', 22), ('birth_year', 1999), ('name', 'jack'), ('siblings', ['jill', 'jen'])])


# Sets in Python 
Sets are another useful data type. Sets are an unordered collection of unique elements, which means any duplicates are automatically removed. Sets allow you to do operations like union, intersection, and difference. Here’s an example:

In [6]:
my_set = set()
my_set.add(1)
my_set.add(2)
my_set.add(1)
# Note that the set only contains a single 1 value
print("---0---")
print(my_set)

my_set2 = set()
my_set2.add(1)
my_set2.add(2)
my_set2.add(3)
my_set2.add(4)
print("---1---")
print(my_set2)

# Prints the overlap
print("---1---")
print(my_set.intersection(my_set2))
print("---2---")

# Prints the combination
print(my_set.union(my_set2))

# Prints the difference (those in my_set but not my_set2)
print("---3---")
print(my_set.difference(my_set2))

---0---
{1, 2}
---1---
{1, 2, 3, 4}
---1---
{1, 2}
---2---
{1, 2, 3, 4}
---3---
set()


# Control structures in Python 
The if-else construct #

In [7]:
def age_check(age):

    if age > 40:  # if age greater than 40, print "Older than 40"
        print("Older than 40")
    elif age > 30 and age <= 40: # if age greater than 30 and less than or equal to 40, print "Between 30 and 40"
        print("Between 30 and 40")
    else:  # if neither of the previous conditions are met, print "Other"
        print("Other")

print(age_check(41))

Older than 40
None


# The for construct 
Loops allow you to iterate over an iterable. That’s not a very helpful definition, so let’s consider the most common use case, lists. A loop allows you to iterate over a list or other data types that also allow iteration.

You can contain your iterable in the enumerate() command to add a counter to your loop. This is useful if you want to loop over a list of values while still having access to the iterable index.

In [8]:
names = ['tyler', 'karen', 'jill']   # list containing names

for i, name in enumerate(names):     # iterating over names
    print("Index: {0}".format(i))    # printing index number
    print("Value: {0}".format(name)) # print the value at the index

Index: 0
Value: tyler
Index: 1
Value: karen
Index: 2
Value: jill


# The zip function 
Lastly, a slightly more complicated function: zip. You can do a lot of useful things with zip, but here are 2 common use cases:

Combining two lists into a list of tuples
Breaking a tuple into two lists
Combining two lists into a tuple 
The zip actually returns a generator, so we have to wrap it in list() to print it. This would not be necessary if you wanted to loop over it though, because generators are iterable:

In [9]:
list_1 = [1, 2, 3]  # create your first list
list_2 = ['x', 'y', 'z']  # create your second list

print(list(zip(list_1, list_2)))  #combine and print

[(1, 'x'), (2, 'y'), (3, 'z')]


In [10]:
pairs = [('x', 1), ('y', 2), ('z', 3)]  # a list of tuples
letters, numbers = zip(*pairs)  # break into two lists

print(letters)  # print the first values of the tuples
print(numbers)  # print the second values of the tuples

('x', 'y', 'z')
(1, 2, 3)


# Handling arrays with NumPy 
Single dimensional arrays

In [11]:
import numpy as np

# This creates our array
np_array = np.array([5, 10, 15, 20, 25, 30])
print("--0--")

# Gets the unique values
print(np.unique(np_array))
print("--1--")

# Calculates the standard deviation
print(np.std(np_array))
print("--2--")

# Calculates the maximum
print(np_array.max())
print("--3--")

# Squares each value in the array
print(np_array ** 2)
print("--4--")

# Adds the arrays together element wise
print(np_array + np_array)
print("--5--")

# The sum of the squares of the elements
print(np.sum(np_array ** 2))
print("--6--")

# Gives you the shape: (rows, columns)
print(np_array.shape)

--0--
[ 5 10 15 20 25 30]
--1--
8.539125638299666
--2--
30
--3--
[ 25 100 225 400 625 900]
--4--
[10 20 30 40 50 60]
--5--
2275
--6--
(6,)


In [12]:
import numpy as np

# Create 2d array
print("--0--")
np_2d_array = np.array([[1,2,3], 
                        [4,5,6]])
print(np_2d_array)

# Calculate the transpose, which is when you swap the columns and rows.
print("--1--")
np_2d_array_T = np_2d_array.T
print(np_2d_array_T)

# Print the shape of the array as (number of rows, number of columns)
print("--3--")
print(np_2d_array.shape)

# Access elements in the 2d array by index. 
# First index is the row number
# Second index is the column number
# Index numbers start from 0
print("--4--")
print(np_2d_array[1,1])
print(np_2d_array[0,2])

--0--
[[1 2 3]
 [4 5 6]]
--1--
[[1 4]
 [2 5]
 [3 6]]
--3--
(2, 3)
--4--
5
3


# Sampling the data 
Once you have some data, it can be useful to sample from it.

As a refresher, sampling is a way to take a smaller group from a population. For example, you might create a random sample from the U.S. population randomly knocking on 10 doors in the U.S. That would be a sample size of 10.

The choice() function allows you to pass an array, specify how many values to sample, and decide whether sampling should be done with or without replacement. Sampling without replacement means the same value can’t be sampled more than once.

In [13]:
import numpy as np
array = np.array([1,2,3,4,5])

# Sample 10 data points with replacement. 
print("--0--")
print(np.random.choice(array, 10, replace=True))

# Sample 3 data points without replacement. 
print("--1--")
print(np.random.choice(array, 3, replace=False))

--0--
[2 2 4 3 3 4 2 1 4 5]
--1--
[2 5 4]


In [14]:
import numpy as np

x = [1,2,3,4,5]  # Create a list of 5 elements
np.random.shuffle(x)  # Randomly shuffle the order of the elements in the list

print(x)

[3, 2, 5, 4, 1]


# Scipy an External Library
This lesson introduces an external scipy library by discussing in detail how scipy provides support to handle statistics and probabilistic functionalities.

# Calculating correlations 
Scipy is a Python library for scientific computing. Scipy and Numpy are the core libraries that Pandas is built upon. We will discuss Pandas later in the course, but having an understanding of Scipy and Numpy before discussing Pandas is useful.

A correlation is a numerical measure of the statistical relationship between two variables. For us, those variables will usually be two columns of data, for example, the temperature outside and the likelihood of rain.

One way to calculate the correlation between two vectors of data is with Pearson’s r-value. This value ranges between -1 and 1. Where -1 means there is a total negative correlation, 0 means no correlation, and 1 means total positive.

In [15]:
from scipy import stats
import numpy as np

array_1 = np.array([1,2,3,4,5,6])  # Create a numpy array from a list
array_2 = array_1  # Create another array with the same values

print(stats.pearsonr(array_1, array_2))  # Calculate the correlation which will be 1 since the values are the same 

PearsonRResult(statistic=0.9999999999999999, pvalue=1.8488927466117464e-32)


In [16]:
from scipy import stats

x = stats.norm.rvs(loc=0, scale=10, size=10)  # Generate 10 values randomly sampled from a normal distribution with mean 0 and standard deviation of 10

print(x)

[  2.4080289    8.74384029  -0.97225648   3.18329968  16.18033781
 -11.89214579  -6.19223903  -4.18226201  -0.1299328    5.28423092]


In [17]:
from scipy import stats

p1 = stats.norm.pdf(x=-100, loc=0, scale=10)  # Get probability of sampling a value of -100
p2 = stats.norm.pdf(x=0, loc=0, scale=10)     # Get probability of sampling a value of 0

print(p1)
print(p2)

7.69459862670642e-24
0.03989422804014327


In [18]:
from scipy import stats

print(stats.describe(stats.norm.rvs(loc=0, scale=1, size=500)))  # Calculate descriptive statistics for 500 data points sampled from normal distribution with mean 0 and standard deviation of 1

DescribeResult(nobs=500, minmax=(-2.8725880003088275, 2.8665288954726305), mean=0.023191363856166755, variance=1.014112664629023, skewness=-0.05153834576173098, kurtosis=-0.21087812976765408)


# creating csv files

In [19]:
import pandas as pd

# Define the column names as a list
names = ['age', 'workclass', 'fnlwgt', 'education', 'educationnum', 'maritalstatus', 'occupation', 'relationship', 'race',
        'sex', 'capitalgain', 'capitalloss', 'hoursperweek', 'nativecountry', 'label']
# Read in the CSV file from the webpage using the defined column names
df = pd.read_csv("adult.data", header=None, names=names)
                      
print(df.head())

   age          workclass  fnlwgt   education  educationnum  \
0   39          State-gov   77516   Bachelors            13   
1   50   Self-emp-not-inc   83311   Bachelors            13   
2   38            Private  215646     HS-grad             9   
3   53            Private  234721        11th             7   
4   28            Private  338409   Bachelors            13   

         maritalstatus          occupation    relationship    race      sex  \
0        Never-married        Adm-clerical   Not-in-family   White     Male   
1   Married-civ-spouse     Exec-managerial         Husband   White     Male   
2             Divorced   Handlers-cleaners   Not-in-family   White     Male   
3   Married-civ-spouse   Handlers-cleaners         Husband   Black     Male   
4   Married-civ-spouse      Prof-specialty            Wife   Black   Female   

   capitalgain  capitalloss  hoursperweek   nativecountry   label  
0         2174            0            40   United-States   <=50K  
1         

In [20]:
df.shape

(32561, 15)

# Introduction to JSON file 
JSON (JavaScript Object Notation) is a popular format allowing for a more flexible schema. It is also easy for humans to read and write. A lot of the data sent around the web is transmitted as JSON. Here is an example:

In [21]:
import json

## Define the JSON object as a string
json_string = """{
    "glossary": {
        "title": "example glossary",
        "GlossDiv": {
            "title": "S",
            "GlossList": {
                "GlossEntry": {
                    "ID": "SGML",
                    "SortAs": "SGML",
                    "GlossTerm": "Standard Generalized Markup Language",
                    "Acronym": "SGML",
                    "Abbrev": "ISO 8879:1986",
                    "GlossDef": {
                        "para": "A meta-markup language, used to create markup languages such as DocBook.",
                        "GlossSeeAlso": ["GML", "XML"]
                    },
                    "GlossSee": "markup"
                }
            }
        }
    }
}"""


# Read the JSON data into Python
json_data = json.loads(json_string)

print(json_data)

{'glossary': {'title': 'example glossary', 'GlossDiv': {'title': 'S', 'GlossList': {'GlossEntry': {'ID': 'SGML', 'SortAs': 'SGML', 'GlossTerm': 'Standard Generalized Markup Language', 'Acronym': 'SGML', 'Abbrev': 'ISO 8879:1986', 'GlossDef': {'para': 'A meta-markup language, used to create markup languages such as DocBook.', 'GlossSeeAlso': ['GML', 'XML']}, 'GlossSee': 'markup'}}}}}


In [22]:
#with open('data.json') as f:
 #   data = json.load(f)

# Introduction to raw files 
Sometimes you get data in strange formats and you have to roll your own Python code to process the data. Fortunately, doing this is simple.

For this, we will assume that you have data in some type of text file. Each row of data corresponds to a row in your text file.

For example, you might have a file delimited by a pipe (|). It could look something like this:

In [23]:
###import tempfile

#tmp = tempfile.NamedTemporaryFile()

# Open the file for writing. And write the data.
#with open(tmp.name, 'w') as f:
   # f.write("James|22|M\n")
  #  f.write("Sarah|31|F\n")
 #   f.write("Mindy|25|F")

# Read in the data from our file, line by line
#with open(tmp.name, "r") as f:
   # for line in f:
        #             print(line)
###

In [24]:
import pandas as pd
names = ['age', 'workclass', 'fnlwgt', 'education', 'educationnum', 'maritalstatus', 'occupation', 'relationship', 'race',
        'sex', 'capitalgain', 'capitalloss', 'hoursperweek', 'nativecountry', 'label']
train_df = pd.read_csv("adult.data", header=None, names=names)
print(train_df.head())

   age          workclass  fnlwgt   education  educationnum  \
0   39          State-gov   77516   Bachelors            13   
1   50   Self-emp-not-inc   83311   Bachelors            13   
2   38            Private  215646     HS-grad             9   
3   53            Private  234721        11th             7   
4   28            Private  338409   Bachelors            13   

         maritalstatus          occupation    relationship    race      sex  \
0        Never-married        Adm-clerical   Not-in-family   White     Male   
1   Married-civ-spouse     Exec-managerial         Husband   White     Male   
2             Divorced   Handlers-cleaners   Not-in-family   White     Male   
3   Married-civ-spouse   Handlers-cleaners         Husband   Black     Male   
4   Married-civ-spouse      Prof-specialty            Wife   Black   Female   

   capitalgain  capitalloss  hoursperweek   nativecountry   label  
0         2174            0            40   United-States   <=50K  
1         

In [25]:
import pandas as pd
names = ['age', 'workclass', 'fnlwgt', 'education', 'educationnum', 'maritalstatus', 'occupation', 'relationship', 'race',
        'sex', 'capitalgain', 'capitalloss', 'hoursperweek', 'nativecountry', 'label']
train_df = pd.read_csv("adult.data", header=None, names=names)
print(train_df.describe().T)

                count           mean            std      min       25%  \
age           32561.0      38.581647      13.640433     17.0      28.0   
fnlwgt        32561.0  189778.366512  105549.977697  12285.0  117827.0   
educationnum  32561.0      10.080679       2.572720      1.0       9.0   
capitalgain   32561.0    1077.648844    7385.292085      0.0       0.0   
capitalloss   32561.0      87.303830     402.960219      0.0       0.0   
hoursperweek  32561.0      40.437456      12.347429      1.0      40.0   

                   50%       75%        max  
age               37.0      48.0       90.0  
fnlwgt        178356.0  237051.0  1484705.0  
educationnum      10.0      12.0       16.0  
capitalgain        0.0       0.0    99999.0  
capitalloss        0.0       0.0     4356.0  
hoursperweek      40.0      45.0       99.0  


In [26]:
import pandas as pd

# Read in data as explained in reading CSV lesson
names = ['age', 'workclass', 'fnlwgt', 'education', 'educationnum', 'maritalstatus', 'occupation', 'relationship', 'race',
        'sex', 'capitalgain', 'capitalloss', 'hoursperweek', 'nativecountry', 'label']
train_df = pd.read_csv("adult.data", header=None, names=names)
                      
print(train_df.info())  # Use the info() function on the dataframe

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   age            32561 non-null  int64 
 1   workclass      32561 non-null  object
 2   fnlwgt         32561 non-null  int64 
 3   education      32561 non-null  object
 4   educationnum   32561 non-null  int64 
 5   maritalstatus  32561 non-null  object
 6   occupation     32561 non-null  object
 7   relationship   32561 non-null  object
 8   race           32561 non-null  object
 9   sex            32561 non-null  object
 10  capitalgain    32561 non-null  int64 
 11  capitalloss    32561 non-null  int64 
 12  hoursperweek   32561 non-null  int64 
 13  nativecountry  32561 non-null  object
 14  label          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB
None


# Converting data types 
If a column doesn’t seem to have the correct type, it is easy to convert it to different types using .to_() functions:

to_numeric()
to_datetime()
to_string()
For example:

df['numeric_column'] = pd.to_numeric(df['string_column'])#

In [27]:
# check for the unique value 
print(train_df['relationship'].unique())

[' Not-in-family' ' Husband' ' Wife' ' Own-child' ' Unmarried'
 ' Other-relative']


In [28]:
import pandas as pd
names = ['age', 'workclass', 'fnlwgt', 'education', 'educationnum', 'maritalstatus', 'occupation', 'relationship', 'race',
        'sex', 'capitalgain', 'capitalloss', 'hoursperweek', 'nativecountry', 'label']
train_df = pd.read_csv("adult.data", header=None, names=names)
print(train_df['relationship'].value_counts())

 Husband           13193
 Not-in-family      8305
 Own-child          5068
 Unmarried          3446
 Wife               1568
 Other-relative      981
Name: relationship, dtype: int64


# Grouping the data 
We can also do these types of counts by specific groups by using the groupby() function. This function takes a list of columns by which you would like to group your dataframe. It then performs the requested calculations on each group individually and returns the results by group. Here is an example:

In [29]:
import pandas as pd
names = ['age', 'workclass', 'fnlwgt', 'education', 'educationnum', 'maritalstatus', 'occupation', 'relationship', 'race',
        'sex', 'capitalgain', 'capitalloss', 'hoursperweek', 'nativecountry', 'label']
train_df = pd.read_csv("adult.data", header=None, names=names)

# Group by relationship and then get the value counts of label with normalization                   
print(train_df.groupby('relationship')['label'].value_counts(normalize=True))

relationship     label 
 Husband          <=50K    0.551429
                  >50K     0.448571
 Not-in-family    <=50K    0.896930
                  >50K     0.103070
 Other-relative   <=50K    0.962283
                  >50K     0.037717
 Own-child        <=50K    0.986780
                  >50K     0.013220
 Unmarried        <=50K    0.936738
                  >50K     0.063262
 Wife             <=50K    0.524872
                  >50K     0.475128
Name: label, dtype: float64


In [30]:

print(train_df.groupby(['workclass'])['hoursperweek'].mean())

workclass
 ?                   31.919390
 Federal-gov         41.379167
 Local-gov           40.982800
 Never-worked        28.428571
 Private             40.267096
 Self-emp-inc        48.818100
 Self-emp-not-inc    44.421881
 State-gov           39.031587
 Without-pay         32.714286
Name: hoursperweek, dtype: float64


# finding the correlation 
Another useful statistic is the correlation. If you need a refresher on correlation, please check out Wikipedia. You can calculate all the pair-wise correlations in your dataframe by using the corr function.

1234567


In [31]:


# Calculate correlations                   
print(train_df.corr())

  print(train_df.corr())


                   age    fnlwgt  educationnum  capitalgain  capitalloss  \
age           1.000000 -0.076646      0.036527     0.077674     0.057775   
fnlwgt       -0.076646  1.000000     -0.043195     0.000432    -0.010252   
educationnum  0.036527 -0.043195      1.000000     0.122630     0.079923   
capitalgain   0.077674  0.000432      0.122630     1.000000    -0.031615   
capitalloss   0.057775 -0.010252      0.079923    -0.031615     1.000000   
hoursperweek  0.068756 -0.018768      0.148123     0.078409     0.054256   

              hoursperweek  
age               0.068756  
fnlwgt           -0.018768  
educationnum      0.148123  
capitalgain       0.078409  
capitalloss       0.054256  
hoursperweek      1.000000  



We can quickly see that compared to all of the correlations, there is a higher correlation between “hours per week” and “education num”, but it is not very high. You will notice, though since our label is an object, it isn’t included here. Knowing how variables correlate with our label would be useful, so let’s take care of that:

In [32]:

# Convert the string label into a value of 1 when >= 50k and 0 otherwise
train_df['label_int'] = train_df.label.apply(lambda x: ">" in x)
print(train_df.corr())

  print(train_df.corr())


                   age    fnlwgt  educationnum  capitalgain  capitalloss  \
age           1.000000 -0.076646      0.036527     0.077674     0.057775   
fnlwgt       -0.076646  1.000000     -0.043195     0.000432    -0.010252   
educationnum  0.036527 -0.043195      1.000000     0.122630     0.079923   
capitalgain   0.077674  0.000432      0.122630     1.000000    -0.031615   
capitalloss   0.057775 -0.010252      0.079923    -0.031615     1.000000   
hoursperweek  0.068756 -0.018768      0.148123     0.078409     0.054256   
label_int     0.234037 -0.009463      0.335154     0.223329     0.150526   

              hoursperweek  label_int  
age               0.068756   0.234037  
fnlwgt           -0.018768  -0.009463  
educationnum      0.148123   0.335154  
capitalgain       0.078409   0.223329  
capitalloss       0.054256   0.150526  
hoursperweek      1.000000   0.229689  
label_int         0.229689   1.000000  


In [33]:

# Use the describe function to calculate the percentiles specified                     
print(train_df.describe(percentiles=[.01,.05,.95,.99]))

                age        fnlwgt  educationnum   capitalgain   capitalloss  \
count  32561.000000  3.256100e+04  32561.000000  32561.000000  32561.000000   
mean      38.581647  1.897784e+05     10.080679   1077.648844     87.303830   
std       13.640433  1.055500e+05      2.572720   7385.292085    402.960219   
min       17.000000  1.228500e+04      1.000000      0.000000      0.000000   
1%        17.000000  2.718580e+04      3.000000      0.000000      0.000000   
5%        19.000000  3.946000e+04      5.000000      0.000000      0.000000   
50%       37.000000  1.783560e+05     10.000000      0.000000      0.000000   
95%       63.000000  3.796820e+05     14.000000   5013.000000      0.000000   
99%       74.000000  5.100720e+05     16.000000  15024.000000   1980.000000   
max       90.000000  1.484705e+06     16.000000  99999.000000   4356.000000   

       hoursperweek  
count  32561.000000  
mean      40.437456  
std       12.347429  
min        1.000000  
1%         8.000000 

 percentile is the value below which a given percent of the data falls.

# Pivot Table 
Somewhat like Excel, we can pivot our data using pandas pivot_table functionality. To do so, we will use the pivot_table() function.

The values parameter is the column being used for aggregation, the index parameter is for the index values that creates multiple rows, and the columns parameter is for the value on which you want to have multiple columns created.

You can also use the aggfunc parameter to pass a function with which to aggregate your pivots.

Let’s look at an example:

In [34]:
import numpy as np
import pandas as pd


# Pivot the data frame to show by relationship, workclass (rows) and label (columns) the average hours per week.
print(pd.pivot_table(train_df, values='hoursperweek', index=['relationship','workclass'], 
               columns=['label'], aggfunc=np.mean).round(2))

label                               <=50K   >50K
relationship    workclass                       
 Husband         ?                  30.72  37.33
                 Federal-gov        42.34  43.05
                 Local-gov          41.40  44.56
                 Private            42.50  46.18
                 Self-emp-inc       48.29  50.49
                 Self-emp-not-inc   46.01  48.07
                 State-gov          38.67  45.17
                 Without-pay        34.25    NaN
 Not-in-family   ?                  31.29  39.44
                 Federal-gov        40.60  47.54
                 Local-gov          40.38  45.01
                 Never-worked       35.00    NaN
                 Private            40.20  47.03
                 Self-emp-inc       49.06  53.58
                 Self-emp-not-inc   41.53  45.02
                 State-gov          38.87  44.19
 Other-relative  ?                  29.10  40.00
                 Federal-gov        38.40  45.00
                 Loc

# Cross Tab 
Crosstab is a nice way to get frequency tables. What you do is pass two columns to the function and you will get the frequency of all the pair-wise combinations of those two variables.

Let’s look at an example using label and relationship as our columns:

In [35]:


# Calculate the frequencies between label and relationship
print(pd.crosstab(train_df['label'], train_df.relationship))

relationship   Husband   Not-in-family   Other-relative   Own-child  \
label                                                                 
 <=50K            7275            7449              944        5001   
 >50K             5918             856               37          67   

relationship   Unmarried   Wife  
label                            
 <=50K              3228    823  
 >50K                218    745  


In [36]:

# Crosstab with normalized outputs
print(pd.crosstab(train_df['label'], train_df.relationship, normalize=True))

relationship   Husband   Not-in-family   Other-relative   Own-child  \
label                                                                 
 <=50K        0.223427        0.228771         0.028992    0.153589   
 >50K         0.181751        0.026289         0.001136    0.002058   

relationship   Unmarried      Wife  
label                               
 <=50K          0.099137  0.025276  
 >50K           0.006695  0.022880  


# Reshape 
With Pandas, you can use pivot() to reshape your data. To illustrate this concept, I will use code from this post to create a dataframe in a long format.


In [37]:
import pandas.util.testing as tm; tm.N = 3
import numpy as np
import pandas as pd 

def unpivot(frame):
    N, K = frame.shape
    data = {'value' : frame.values.ravel('F'),
            'variable' : np.asarray(frame.columns).repeat(N),
            'date' : np.tile(np.asarray(frame.index), K)}
    return pd.DataFrame(data, columns=['date', 'variable', 'value'])
df = unpivot(tm.makeTimeDataFrame())
print(df)

  import pandas.util.testing as tm; tm.N = 3


          date variable     value
0   2000-01-03        A  1.347667
1   2000-01-04        A  0.722874
2   2000-01-05        A -0.704584
3   2000-01-06        A  0.337961
4   2000-01-07        A -0.590463
..         ...      ...       ...
115 2000-02-07        D -0.004090
116 2000-02-08        D -0.985388
117 2000-02-09        D  1.226312
118 2000-02-10        D  0.532080
119 2000-02-11        D  1.556491

[120 rows x 3 columns]


In [40]:
import pandas.util.testing as tm; tm.N = 3
import numpy as np
import pandas as pd 

def unpivot(frame):
    N, K = frame.shape
    data = {'value' : frame.values.ravel('F'),
            'variable' : np.asarray(frame.columns).repeat(N),
            'date' : np.tile(np.asarray(frame.index), K)}
    return pd.DataFrame(data, columns=['date', 'variable', 'value'])
df = unpivot(tm.makeTimeDataFrame())

# Use pivot to keep date as the index and value as the values, but use the vaiable column to create new columns
df_pivot = df.pivot(index='date', columns='variable', values='value')
print(df_pivot)

variable           A         B         C         D
date                                              
2000-01-03  0.145465  1.275622 -0.796562  0.028705
2000-01-04  1.842369  1.057237 -0.357448  0.411317
2000-01-05  1.491366 -0.371615  0.935339  0.207586
2000-01-06  0.502395 -1.210836 -1.726363  1.547884
2000-01-07 -0.970034 -1.122811 -1.392785 -0.064032
2000-01-10  0.676547 -0.930433  0.082083 -0.398931
2000-01-11  0.708579  0.442614  0.245391 -1.182064
2000-01-12 -0.357984 -0.884836 -0.845117 -1.115405
2000-01-13  0.022101  1.412799  0.594949  0.574728
2000-01-14 -0.467741 -0.427137  0.927292  0.017947
2000-01-17  0.579834 -1.066233  3.110033 -1.071741
2000-01-18  0.274805 -0.560597  0.927202 -2.048019
2000-01-19 -1.514235  0.319917 -0.988648  0.280133
2000-01-20  1.625991 -2.357232 -0.356871 -0.954610
2000-01-21 -0.093331  0.196612  0.889723 -0.497676
2000-01-24 -0.286103  0.244241  0.476519 -0.106592
2000-01-25  0.152012 -0.399965  0.772984  0.707590
2000-01-26 -0.305735 -0.916677 

In [42]:
import pandas.util.testing as tm; tm.N = 3
import numpy as np
import pandas as pd 

# Create long dataframe
def unpivot(frame):
    N, K = frame.shape
    data = {'value' : frame.values.ravel('F'),
            'variable' : np.asarray(frame.columns).repeat(N),
            'date' : np.tile(np.asarray(frame.index), K)}
    return pd.DataFrame(data, columns=['date', 'variable', 'value'])
df = unpivot(tm.makeTimeDataFrame())

# Convert to wide format
df_pivot = df.pivot(index='date', columns='variable', values='value')

# Convert back to long format
print(df_pivot.unstack())

variable  date      
A         2000-01-03    0.492193
          2000-01-04   -0.363407
          2000-01-05   -0.345795
          2000-01-06    1.137429
          2000-01-07    0.303698
                          ...   
D         2000-02-07   -0.540596
          2000-02-08    0.631036
          2000-02-09    1.373140
          2000-02-10   -0.945360
          2000-02-11   -0.391222
Length: 120, dtype: float64


In [50]:
df.columns

Index(['date', 'variable', 'value'], dtype='object')

In [None]:
# import pandas as pd

# # Loading dataset
# def read_csv():
#     # Define the column names as a list
    
#     # Read in the CSV file from the webpage using the defined column names
#     df = pd.read_csv("adult.data", header=None, names=names, delim_whitespace=True)
#     return df

# # Describing data
# def group_aggregation(df, group_var, agg_var):

#     # Grouping the data and taking mean
#     grouped_df = df.groupby([group_var])[agg_var].mean()
#     return grouped_df

# # Calling the function
# print(group_aggregation(read_csv(),"capitalgain","hoursperweek"))

Types of scaling 
Standard scaling 
Standard scaling subtracts the mean and divides by the standard deviation. This centers the feature on zero with unit variance.

In [58]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler
import numpy as np

# Create a matrix of data
data = [[-1, 2], 
        [-0.5, 6], 
        [0, 10], 
        [1, 18]]

print("Before Standard scaling")
print(np.mean(data, 0))
print(np.std(data, 0))

# Initalize a StandardScaler
standard = StandardScaler()
# Fit and transform the data with the StandardScaler
standard_data = standard.fit_transform(data)

print("After Standard scaling")
print(np.mean(standard_data, 0))
print(np.std(standard_data, 0))

Before Standard scaling
[-0.125  9.   ]
[0.73950997 5.91607978]
After Standard scaling
[0. 0.]
[1. 1.]


In the example above, we created a NumPy array of shape (4,2). We then use the StandardScaler() from sklearn which will automatically subtract the mean and divide by the standard deviation of each of our columns. This is done with the fit_transform() call.

We check that it worked by printing the mean and standard deviation. We can see that both columns now have a mean of 0 and a standard deviation of 1.

# Min/Max scaling 
Let’s look at the same example but instead use the MinMaxScaler() from sklearn.

In [59]:
from sklearn.preprocessing import MinMaxScaler
import numpy as np

# Create matrix of data
data = [[-1, 2], 
        [-0.5, 6], 
        [0, 10], 
        [1, 18]]

# Initalize MinMaxScaler
min_max = MinMaxScaler()
# Fit and transform the data
min_max_data = min_max.fit_transform(data)

print(np.min(min_max_data, 0))
print(np.max(min_max_data, 0))
print(np.mean(min_max_data, 0))
print(np.std(min_max_data, 0))

[0. 0.]
[1. 1.]
[0.4375 0.4375]
[0.36975499 0.36975499]


# Introduction to categorical data 
Sometimes you get categorical data which are variables with a limited and usually fixed number of values. For example, male and female. Machine learning algorithms need numbers to work, so how do you deal with these? We will discuss two ways:

Label encoding

One-hot encoding a.k.a. dummy variables.

# Dealing with categorical data 
Label encoding
 
Label encoding works by converting the unique values to a numeric representation. For example, if we have two categories male and female, we can categorize them as numbers:

male as 0
female 1

In [60]:
import pandas as pd

# Create series with male and female values
non_categorical_series = pd.Series(['male', 'female', 'male', 'female'])
# Convert the text series to a categorical series
categorical_series = non_categorical_series.astype('category')
# Print the numeric codes for each value
print(categorical_series.cat.codes)
# Print the category names
print(categorical_series.cat.categories)

0    1
1    0
2    1
3    0
dtype: int8
Index(['female', 'male'], dtype='object')


# One-hot encoding 

One-hot encoding is similar but creates a new column for each category and fills it with a 1 for each row with that value and zero otherwise.

In [61]:
import pandas as pd

# Create series with male and female values
non_categorical_series = pd.Series(['male', 'female', 'male', 'female'])
# Create dummy or one-hot encoded variables
print(pd.get_dummies(non_categorical_series))

   female  male
0       0     1
1       1     0
2       0     1
3       1     0


We see that we just had to use the get_dummies() call on our series and it automatically makes new columns for each unique value in our series and fills them with a 1 or 0 as appropriate.

Now that you have a grasp on some ways of cleaning our data, the next lesson brings you a challenge to solve.

In [None]:
# import pandas as pd

# def read_csv():
#     # Define the column names as a list
#     names = ["mpg", "cylinders", "displacement", "horsepower", "weight", "acceleration", "model_year", "origin", "car_name"]
#     # Read in the CSV file from the webpage using the defined column names
#     df = pd.read_csv("auto-mpg.data", header=None, names=names, delim_whitespace=True)
#     return df

# # Remving outliers from the data
# def outlier_detection(df):
#     df = df.quantile([.90, .10])
#     return df

# print(outlier_detection(read_csv()))

# The importance of data visualization 
So far, we have looked at understanding data via descriptive statistics and tables. Another useful tool is visualization.

Visualizations of data can provide the following benefits:

A better understanding of the data

A more compelling story when explaining the data

An easier to comprehend medium

In [None]:
# from sklearn.datasets import load_boston
# import pandas as pd

# # Load the boston dataset from sklearn.datasets
# boston_data = load_boston()
# # Enter the boston data into a dataframe
# boston_df = pd.DataFrame(boston_data.data, columns=boston_data.feature_names)

# # Print the first 5 rows to confirm ran correctly
# # print(boston_df.head())