# Python Fundamentals for Tuesday, September 19


Follow along as we work our way through this document. Use the code cells to test and experiment as we go. Try to complete all of the practice exercises and ask questions if you need assistance.

This is a list of the topics we will cover today, starting where we left off on Tuesday:

1. Importing files
2. Managing Data with Pandas
3. Using regular expressions

Follow along as we work our way through this document. Use the code cells to test and experiment as we go. Try to complete all of the practice exercises and ask questions if you need assistance. 

# Files

When we want to read or write a file (say on your hard drive), we first must open the file. Opening the file communicates with your operating system, which knows where the data for each file is stored. When you open a file, you are asking the operating system to find the file by name and make sure the file exists. In this example, we open the file mbox.txt, which should be stored in the same folder that you are in when you start Python. You can download this file from www.py4e.com/code3/mbox.txt

To break the file into lines, there is a special character that represents the “end of the line” called the newline character.

In Python, we represent the newline character as a backslash-n in string constants. Even though this looks like two characters, it is actually a single character. When we look at the variable by entering “stuff” in the interpreter, it shows us the \n in the string, but when we use print to show the string, we see the string broken into two lines by the newline character.

While the file handle does not contain the data for the file, it is quite easy to construct a for loop to read through and count each of the lines in a file:

In [47]:
fhand = open('../data/mbox.txt')
count = 0
for line in fhand:
    count += 1
print("Line Count:", count)

Line Count: 132045


We can use the file handle as the sequence in our for loop. Our for loop simply counts the number of lines in the file and prints them out. The rough translation of the for loop into English is, “for each line in the file represented by the file handle, add one to the count variable.”

The reason that the open function does not read the entire file is that the file might be quite large with many gigabytes of data. The open statement takes the same amount of time regardless of the size of the file. The for loop actually causes the data to be read from the file.

If you know the file is relatively small compared to the size of your main memory, you can read the whole file into one string using the read method on the file handle.

In [48]:
fhand = open("../data/mbox.txt", 'r')
inp = fhand.read()
print(len(inp))
print(inp[:20])

6687002
From stephen.marquar


In [49]:
fhand = open('../data/mbox.txt')
count = 0
for line in fhand:
    line = line.rstrip()
    if line.startswith('From'):
        print(line)

From stephen.marquard@uct.ac.za Sat Jan  5 09:14:16 2008
From: stephen.marquard@uct.ac.za
From louis@media.berkeley.edu Fri Jan  4 18:10:48 2008
From: louis@media.berkeley.edu
From zqian@umich.edu Fri Jan  4 16:10:39 2008
From: zqian@umich.edu
From rjlowe@iupui.edu Fri Jan  4 15:46:24 2008
From: rjlowe@iupui.edu
From zqian@umich.edu Fri Jan  4 15:03:18 2008
From: zqian@umich.edu
From rjlowe@iupui.edu Fri Jan  4 14:50:18 2008
From: rjlowe@iupui.edu
From cwen@iupui.edu Fri Jan  4 11:37:30 2008
From: cwen@iupui.edu
From cwen@iupui.edu Fri Jan  4 11:35:08 2008
From: cwen@iupui.edu
From gsilver@umich.edu Fri Jan  4 11:12:37 2008
From: gsilver@umich.edu
From gsilver@umich.edu Fri Jan  4 11:11:52 2008
From: gsilver@umich.edu
From zqian@umich.edu Fri Jan  4 11:11:03 2008
From: zqian@umich.edu
From gsilver@umich.edu Fri Jan  4 11:10:22 2008
From: gsilver@umich.edu
From wagnermr@iupui.edu Fri Jan  4 10:38:42 2008
From: wagnermr@iupui.edu
From zqian@umich.edu Fri Jan  4 10:17:43 2008
From: zqian@

The open function takes two arguments, the filename and the mode. There are four possible modes, with read being the default when no mode is specified. 

* "r" - Read - Default value. Opens a file for reading, error if the file does not exist

* "a" - Append - Opens a file for appending, creates the file if it does not exist

* "w" - Write - Opens a file for writing, creates the file if it does not exist

* "x" - Create - Creates the specified file, returns an error if the file exist

When you are searching through data in a file, it is a very common pattern to read through a file, ignoring most of the lines and only processing lines which meet a particular condition. We can combine the pattern for reading a file with string methods to build simple search mechanisms.

For example, if we wanted to read a file and only print out lines which started with the prefix “From:”, we could use the string method `startswith()` to select only those lines with the desired prefix:

Each of the lines ends with a newline, so the print statement prints the string in the variable line which includes a newline and then print adds another newline, resulting in the double spacing effect we see.

We could use line slicing to print all but the last character, but a simpler approach is to use the rstrip method which strips whitespace from the right side of a string as follows:

To write a file, you have to open it with mode “w” as a second parameter. If the file already exists, opening it in write mode clears out the old data and starts fresh, so be careful! If the file doesn’t exist, a new one is created.

In [50]:
#fout = open("output.txt", "w")

The write method of the file handle object puts data into the file, returning the number of characters written. The default write mode is text for writing (and reading) strings.

When you are done writing, you have to close the file to make sure that the last bit of data is physically written to the disk so it will not be lost if the power goes off.

# Pandas Fundamentals 

`pandas` is an open-source Python library that provides data structures and data analysis tools for working with structured data. It is widely used in data science, data analysis, and data manipulation tasks. Pandas is built on top of the NumPy library and provides easy-to-use data structures such as Series and DataFrame, which are designed to handle and manipulate data efficiently.`pandas` is well suited for many different kinds of data:

* Tabular data with heterogeneously-typed columns, as in an Excel spreadsheet

* Ordered and unordered (not necessarily fixed-frequency) time series data.

* Matrix data (homogeneously typed or heterogeneous) with row and column labels

* Any observational / statistical data sets.


The two primary data structures of pandas are Series (1-dimensional) and DataFrame (2-dimensional). A DataFrame does essentially everything that R’s `data.frame` does and then some. `pandas` is built on top of NumPy and integrates well with many other 3rd party libraries.


## Pandas Data Frames

You can think of a data frame as the python analog to an excel spreadsheet; it is a simple table with an **unlimited** number of rows and columns. Typically the rows of a data frame will reference the observational unit while columns will reference variables describing each observational unit. In a dataframe the information is **related** both across both rows and columns. 

### Creating Data Frames

Data frames are created in Anaconda python using the `DataFrame()` function. To call the `DataFrame()` function from the pandas package you should type something similar to `pd.DataFrame()`, where the prefix can vary depending on how you have imported pandas. The parameter names of `DataFrame()` are used as the column names in the data frame and the parameter data are the variables of the data frame (columns).

Use the code below to create you own data frame. The code initially creates a data frame with 4 observations of 3 variables, however you should edit this to create data frames of varying dimensionality. The name that has been assigned to this particular data frame is df, which you will find is a common 'pythonic' naming convention. The variables have been named X, Y,  and Z.

In [51]:
import pandas as pd  #load the pandas package and call it pd

df = (
    pd.DataFrame(data={ 
        'X': [1, 2, 3, 4],
        'Y': [5, 3, 2, 1],
        'Z': [4, 2, 1, 7]}))

In [52]:
whos

Variable    Type             Data/Info
--------------------------------------
count       int              0
df          DataFrame           X  Y  Z\n0  1  5  4\n1<...>2\n2  3  2  1\n3  4  1  7
df1         DataFrame           John  Paul  George  Ri<...>  13    26      52    104
df_         DataFrame           John  Paul  George  Ri<...>  13    26      52    104
df_empty    DataFrame        Empty DataFrame\nColumns: []\nIndex: []
fhand       TextIOWrapper    <_io.TextIOWrapper name='<...>ode='r' encoding='UTF-8'>
inp         str              From stephen.marquard@uct<...>ce > Preferences.\n\n\n\n
line        str              
pd          module           <module 'pandas' from 'c:<...>es\\pandas\\__init__.py'>
wine_path   str              ../data/WineData.csv
wines       DataFrame             Class label  Alcohol<...>\n[178 rows x 14 columns]
wines1      DataFrame             Class label  Alcohol<...>\n[178 rows x 14 columns]
z_var       Series           0    4\n1    2\n2    1\n3<...> 7\

In [53]:
df

Unnamed: 0,X,Y,Z
0,1,5,4
1,2,3,2
2,3,2,1
3,4,1,7


We see that both a `DataFrame` and `module` have been created. You can use `?` to check them out. Other packages that you will frequently use in this course are numpy and scipy. As a **quick aside** we can call for those to load as well and inspect them to see how they might be used.

In [54]:
pd?

[1;31mType:[0m        module
[1;31mString form:[0m <module 'pandas' from 'c:\\Users\\carso\\anaconda3\\lib\\site-packages\\pandas\\__init__.py'>
[1;31mFile:[0m        c:\users\carso\anaconda3\lib\site-packages\pandas\__init__.py
[1;31mDocstring:[0m  
pandas - a powerful data analysis and manipulation library for Python

**pandas** is a Python package providing fast, flexible, and expressive data
structures designed to make working with "relational" or "labeled" data both
easy and intuitive. It aims to be the fundamental high-level building block for
doing practical, **real world** data analysis in Python. Additionally, it has
the broader goal of becoming **the most powerful and flexible open source data
analysis / manipulation tool available in any language**. It is already well on
its way toward this goal.

Main Features
-------------
Here are just a few of the things that pandas does well:

  - Easy handling of missing data in floating point as well as non-floating
    point 

A full list of the packages included in the Anaconda distribution is [here](https://docs.anaconda.com/anaconda/packages/pkg-docs/). Should you need to access a package that is not pre-loaded into Anaconda you can almost certainly install it via the Conda prompt. We will work through some examples together later in the course, but for now we will be satisfied with the modules already accessible.

**Returning to our data frame**, note that the name and the assignment appear on one line of code, ending in a open parenthesis. This allows the code to be continued over multiple lines of code. The name of the function, here `pd.DataFrame()`, is on a seperate line and it is indented, as is the name for each column. This nesting is not necessary, but it makes the data structure clear to the reader. The functions and methods being employed are spotlighted by this approach. 

The *data parameter* is given inside of `{ and}`. Remember those squigly brackets indicate this is a Python dictionary object.  Here the dict maps the names X, Y, and Z to lists. The list operators are `[ and ]` and the values of the list are seperated by commas. Each list in this assignment statement will become a column of the data frame. You can see what the data looks like by printing it just as your would any other object:

In [55]:
print(df)                     # Print the dataframe. 
print('\n', type(df))

   X  Y  Z
0  1  5  4
1  2  3  2
2  3  2  1
3  4  1  7

 <class 'pandas.core.frame.DataFrame'>


Dataframes are easy to read when printed. Let's break it down: 

This created a DataFrame object. The syntax prefixing `DataFrame` in the print out describes the hierarchy: `DataFrame` is part of the `frame` group which is part of the `core` group of the package `pandas`. All you need to know is that you have just 'called' the `DataFrame()` function from the pandas (`pd.`) package. The `DataFrame()` function creates dataframes from other objects. If we don't pass an argument, it creates an empty DataFrame. For example:

In [56]:
df_empty = pd.DataFrame()
print(df_empty)

Empty DataFrame
Columns: []
Index: []


Circling back to our first dataframe (df), let's reference the shape attribute of a DataFrame to see how big our data is. Run the following to see if your mental mapping matches how Anaconda has stored the data:

In [57]:
print('Data frame shape:', df.shape)   

print('Data frame size:', df.size)  

print('Data frame types:', df.dtypes)

Data frame shape: (4, 3)
Data frame size: 12
Data frame types: X    int64
Y    int64
Z    int64
dtype: object


A more visual inspection of the data frame can be achieved without the print() function. The jupyter notebook adds some nice formatting that makes it easy to identify variable values for each observation.

### The Data Frame Index

Notice The column of numbers on the left-hand side of the data frame. It does not have a header. This is the **index** that pandas uses to tell our observations (rows) apart.  The index does not need to run from 0 to n. For example, if working with a time series it would be advantageous for our index to be a time variable. The index can be altered just like any of the other columns. To reference a column in the DataFrame simply use the name in the header. To print the last column on the right you would type

In [58]:
print(df['Z'])

0    4
1    2
2    1
3    7
Name: Z, dtype: int64


Printing a column returns both the index and the column, as well as the type of data contained in the column. 

In [59]:
z_var = df['Z']
print(z_var)
print(type(z_var))

0    4
1    2
2    1
3    7
Name: Z, dtype: int64
<class 'pandas.core.series.Series'>


When we extract a single column from a DataFrame, we are given a **Series**.

In [60]:
print(type(df['Z']))

<class 'pandas.core.series.Series'>


When working with big data it is likely you will only want to view small portions of the whole. The beginning of a data frame can be displayed in table format with the `head()` data frame method. If the command below shows you the 'head' of the dataset, how would you find the 'tail'?

In [61]:
print(df.tail(1))

   X  Y  Z
3  4  1  7


## Practice Exercise

1. Create a data frame with seven observations of four variables. Name the variables john, paul, george, and ringo. Make the first column the first 7 odd numbers going from zero towards infinity. Make each subsequent column twice the value of the column to its left.

2. Print out the number of observations and variables in the data frame from (1)

In [62]:
df_ = (
    pd.DataFrame(data={ 
        'John': [1,3,5,7,9,11,13],
        'Paul': [2,6,10,14,18,22,26],
        'George': [4,12,20,28,36,44,52],
        'Ringo': [8,24,20,56,72,88,104]}))

In [63]:
df_

Unnamed: 0,John,Paul,George,Ringo
0,1,2,4,8
1,3,6,12,24
2,5,10,20,20
3,7,14,28,56
4,9,18,36,72
5,11,22,44,88
6,13,26,52,104


## Operations on Rows and Columns.

Many of the data files we will be working with are organized as a table (rows and columns). Each row is recorded on a separate line of the file and columns separated by a delimiting character. These files can be viewed using text editors such as Notepad. 

A csv file is a very common type of delimited file that uses commas as the deliminators. Other common separators in delimited files are tabs and pipes, "|". They are the same as csv files except for the use of different separators. These other non comma delimited files will often have a file type of .txt or .dat. 

Importing any type of data is trivial if you know how to give good directions. 

In [64]:
wine_path = '../data/WineData.csv'
wines = pd.read_csv(wine_path)

The following displays the class, column types, and the values of the first few rows.

In [65]:
wines.dtypes

Class label                       int64
Alcohol                         float64
Malic acid                      float64
Ash                             float64
Alcalinity of ash               float64
Magnesium                         int64
Total phenols                   float64
Flavanoids                      float64
Nonflavanoid phenols            float64
Proanthocyanins                 float64
Color intensity                 float64
Hue                             float64
OD280/OD315 of diluted wines    float64
Proline                           int64
dtype: object

We can also calculate summary statistics for each variable in the dataframe:

In [66]:
# Calculate summary statistics
wines.describe()

Unnamed: 0,Class label,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280/OD315 of diluted wines,Proline
count,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0
mean,1.938202,13.000618,2.336348,2.366517,19.494944,99.741573,2.295112,2.02927,0.361854,1.590899,5.05809,0.957449,2.611685,746.893258
std,0.775035,0.811827,1.117146,0.274344,3.339564,14.282484,0.625851,0.998859,0.124453,0.572359,2.318286,0.228572,0.70999,314.907474
min,1.0,11.03,0.74,1.36,10.6,70.0,0.98,0.34,0.13,0.41,1.28,0.48,1.27,278.0
25%,1.0,12.3625,1.6025,2.21,17.2,88.0,1.7425,1.205,0.27,1.25,3.22,0.7825,1.9375,500.5
50%,2.0,13.05,1.865,2.36,19.5,98.0,2.355,2.135,0.34,1.555,4.69,0.965,2.78,673.5
75%,3.0,13.6775,3.0825,2.5575,21.5,107.0,2.8,2.875,0.4375,1.95,6.2,1.12,3.17,985.0
max,3.0,14.83,5.8,3.23,30.0,162.0,3.88,5.08,0.66,3.58,13.0,1.71,4.0,1680.0


Pandas provides a method called “loc” that can retrieve rows from the data frame. Rows can also be selected using the “iloc” function. The loc and iloc functions are similar, but there is a subtle difference between the two. Whereas the `loc` function selects rows and columns with specific labels, `iloc` selects rows and columns at specific integer positions. This is not of particular important when working with rows, but will be important when working with columns.

In [67]:
wines1 = wines.copy()

Now let's change the row index from a series of ascending numbers to a series of labels with no discernable order:

In [68]:
 #create a copy so the original data can still be accessed
 #reset index values

df1 = df.copy()
df1.index = ["R1", "R2", "R3", "R4"]


**Note**: With the parameter `deep=False`, only the reference to the data (and index) will be copied. This is a shallow copy. If a shallow copy is created, only changes made in the original will be reflected in the copy, and, any changes made in the copy will be reflected in the original.

In [69]:
wines1

Unnamed: 0,Class label,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280/OD315 of diluted wines,Proline
0,1,14.23,1.71,2.43,15.6,127,2.80,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,1,13.20,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.40,1050
2,1,13.16,2.36,2.67,18.6,101,2.80,3.24,0.30,2.81,5.68,1.03,3.17,1185
3,1,14.37,1.95,2.50,16.8,113,3.85,3.49,0.24,2.18,7.80,0.86,3.45,1480
4,1,13.24,2.59,2.87,21.0,118,2.80,2.69,0.39,1.82,4.32,1.04,2.93,735
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
173,3,13.71,5.65,2.45,20.5,95,1.68,0.61,0.52,1.06,7.70,0.64,1.74,740
174,3,13.40,3.91,2.48,23.0,102,1.80,0.75,0.43,1.41,7.30,0.70,1.56,750
175,3,13.27,4.28,2.26,20.0,120,1.59,0.69,0.43,1.35,10.20,0.59,1.56,835
176,3,13.17,2.59,2.37,20.0,120,1.65,0.68,0.53,1.46,9.30,0.60,1.62,840


In [70]:
df

Unnamed: 0,X,Y,Z
0,1,5,4
1,2,3,2
2,3,2,1
3,4,1,7


At this point, the difference between `loc` and `iloc` becomes more important. See the examples below where we reference the same data using different methods:

In [72]:
#use the row position to access the data in the X,Y,Z columns
row = df1.iloc[3]
row

X    4
Y    1
Z    7
Name: R4, dtype: int64

In [None]:
#use the row label to access the data in the X,Y,Z columns


To select a particular column call the name of the column inside the data frame. It is also common to use the “loc” method. This method requires the coder to pass the index of the data frame as a parameter. The loc method accepts only integers as a parameter.

In [73]:
column = df1['X']
column

R1    1
R2    2
R3    3
R4    4
Name: X, dtype: int64

In [74]:
#make it prettier with the square bracket
column = df1[['X']]
column

Unnamed: 0,X
R1,1
R2,2
R3,3
R4,4


In [77]:
#Find a particular cell value
cell = df1[['Z']].loc['R2']

## Renaming Columns or Indices of a DataFrame

We previously re-indexed the data using the `.index()` method. To give the (other) columns a different value, it’s best to use the .rename() method.

In [78]:
df1

Unnamed: 0,X,Y,Z
R1,1,5,4
R2,2,3,2
R3,3,2,1
R4,4,1,7


In [79]:
newnames = {'X': 'A', 'Y':'B', 'Z':'C'}
df1.rename(columns=newnames, inplace=True)
df1

Unnamed: 0,A,B,C
R1,1,5,4
R2,2,3,2
R3,3,2,1
R4,4,1,7


In [None]:
#rename just one column at a time


## Practice Exercises

Take 10 minutes to complete the following tasks. If necessary, refer to the tasks we worked through previously in the notebook - but try your best to work from memory. 

1. Import the "Car.csv" data set.

2. Print the type of each variable of the Car data set.

3. Return the number of observations in the Car data set.

4. Print all of the information associated with the Pontiac Firebird.

5. Rename the disp column "Engine Displacement"

6. Return the Engine Displacement of the Datsun 710 model.

In [80]:
#1
cars = pd.read_csv("../data/Cars.csv")

In [83]:
#2
cars.dtypes

model     object
mpg      float64
cyl        int64
disp     float64
hp         int64
drat     float64
wt       float64
qsec     float64
vs         int64
am         int64
gear       int64
carb       int64
dtype: object

In [87]:
#3
cars.count()

model    32
mpg      32
cyl      32
disp     32
hp       32
drat     32
wt       32
qsec     32
vs       32
am       32
gear     32
carb     32
dtype: int64

In [88]:
#4
cars.iloc[cars.index[cars['model'] == "Pontiac Firebird"]]

Unnamed: 0,model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
24,Pontiac Firebird,19.2,8,400.0,175,3.08,3.845,17.05,0,0,3,2


In [90]:
#5
cars.rename(columns={'disp':'Engine Displacement'}, inplace=True)

In [92]:
#6
rowindex = cars.index[cars['model'] == "Datsun 710"]
cars.loc[cars.index[rowindex], "Engine Displacement"]

2    108.0
Name: Engine Displacement, dtype: float64

## Dropping rows and columns from a dataframe

A DataFrame can work like a dictionary of like-indexed Series objects. Identifying columns works with the same syntax as dictionary operations.

Using the `.drop()` method allows users to drop/remove/delete rows from DataFrame. The `axis` parameter specifies what axis the user would like to remove. The default is axis = 0, meaning to remove rows. Using axis=1 or a columns parameter allows the user to remove columns. By default, pandas returns a copy DataFrame after deleting rows, so using `inpalce=True` removes a row from the existing DataFrame.

In [94]:
#create a new DataFrame without the first row
df2 = df1
df2

Unnamed: 0,A,B,C
R1,1,5,4
R2,2,3,2
R3,3,2,1
R4,4,1,7


In [96]:
df2 = df1.drop(['R1'])
df2

Unnamed: 0,A,B,C
R2,2,3,2
R3,3,2,1
R4,4,1,7


In [97]:
df1  #the original data is preserved in this case, due to the deep copy

Unnamed: 0,A,B,C
R1,1,5,4
R2,2,3,2
R3,3,2,1
R4,4,1,7


In [None]:
#in this case, the original data is modified

It is also possible to delete rows over a range. The below example removes all rows starting 3rd row.

Removing data often makes the most sense when you are doing so based upon the values of the data. Imagine you are analyzing data on individuals' earnings and education. You may want to eliminate all individuals who have less than a high-school education or who earn more than $200,000. To achieve this you would remove rows by checking a logical condition. With respect to our current dataframe, imagine if we wanted to drop all observations for which the value in column 'C' is greater than or equal to 5:

In [None]:
#find observations that meet the condition


In [99]:
#create a DataFrame free from the unwanted observations

df2 = df1.loc[df1['C'] >= 5]
df2

Unnamed: 0,A,B,C
R4,4,1,7


Lastly, at least with respect to row operations, you may find missing values. For some observational units it may not be possible to collect all of the necessary data. When this happens you will have to do some data cleaning for None, Null, or np.NaN values using `.dropna()`.

Below we replace a value in our DataFrame with a missing value and a NaN value. The we use `.dropna()` to remove all rows that have None, Null & NaN values on any columns.

In [100]:
import numpy as np

Pandas will recognise a value as **null** if it is a `np.nan` object. These print as **NaN** in the DataFrame. If the missing values are empty strings Pandas won't recognize them as null. To fix this, you can convert the empty stings (or whatever is in your empty cells) to `np.nan` objects using `.replace()` and then call `.dropna()` on the DataFrame to delete rows with null values.

Now that we have covered row operations, working with columns will be a breeze. You can use several methods to delete column. The easiest is the `del` command.

In [None]:
#delete column X

 Data can be added by using the `insert()` function.

In [None]:
# inserting column with single value in data frame

## Practice Exercises

Take 10 minutes to complete the following tasks. If necessary, refer to the tasks we worked through previously in the notebook - but try your best to work from memory. 

1. Return to the "Car.csv" data set and import it as a DataFrame named "cars".

2. Create a column named "Gas_Guzzler" and enter a value of 1 if the vehicle has an average fuel economy of less than 15 mpg and a value of "0" if it has an average fuel economy of 15 or more miles per gallon

3. Drop all Toyotas from the DataFrame.

4. Index the DataFrame using the model name. 

5. Drop the first 5 observations from the DataFrame.

In [None]:
#1

In [None]:
#2

In [None]:
#3

In [None]:
#4

In [None]:
#5

## Regular Expressions

Finding and/or altering patterns in character strings can be achieved simply using regular expressions. Regular expressions, or regex, are a character description of the pattern you want to match. If you know the exact character string you want to find you would provide that as the search string. Often the pattern we are interested in has some variability in it. 

Finding groups of items in your data is greatly aided by using regular expressions. For instance, if you wanted to find all subsets that started with `m` and ended with `w`, we would need a way to represent one or more unknown characters. Regular expression provides `.` for any character and `+` to repeat what precedes it one or more times. Using these allows us to write `m.+w` to search for subsets that start with `m` and ended with `w`.

Some regular expression operators help identify a set of characters to match.

* .	Matches any character
* \w	Matches any word character
* \d	Matches any digit
* \s	Matches any white space
* \S	Matches everything except any white space
* \[ \]	Used to create a list of possible characters to match
* [^ ]	Used to create a list of possible characters to not match
* |	Or operator, allows matching a set of strings

Other operators indicate how many matches of a particular character or matches to be made from a set of characters.

* \+	Match the prior character one or more times
* \*	Match the prior character zero or more times
* \?	Match the prior character at most one time

Other operators can help you define where in a string to start or end a match.


* \b	Used to identify empty string at either edge of a word
* ^	Used to identify the start of the character string
* \$	Used to identify the end of the character string

Finally, if you are 'greedy' you can use the parentheses. They will include as large of a string that can be matched to the pattern inside of the `()`. More than one subset of characters can be matched by using additional sets of (). The reference to one of the matched sub-strings is done using `\n`, where `n` is an integer identifying the ordered number of the `()` to be used. For example \s(al) matches a `a` followed by `l`, if the `a` is preceded by a space. In the string `we are all learning something magical` it would match the al of all but not the al of magical. On the other hand, `\s(\w+)\$` matches the last word of a string if the string ends with a word.

Remember to escape the following characters:

. + * ? [ ] $ ^ ( ) { } | \

In [None]:
practice = "Da hills are alive with da sound of music"

The module for regular expressions is re and it must be imported to use regular expressions. This module is a part of the standard library, meaning you do not need to download or install it prior to use.

## Practice
Practice replaceing letters and adjusting cases. Add and remove spaces. Replace all of the l's with upper case letters with a single command. Successfully execute five regular expression commands, then raise your hand.

Now let's continue to practice using the `TitanicSurvival` dataset:

In [None]:
import pandas as pd
titanic_path = 

titanic_in = pd.read_csv(titanic_path)
titanic_in



In [None]:
titanic_in = (
    titanic_in
        .rename(
            columns={
                'Unnamed: 0': 'name',
                'passengerClass': 'pass_class'}))
titanic = titanic_in.copy(deep=True)

titanic

Let's start with something simple: changing every instance of the name `Hudson` to `Rock`. 

Next up, we can change every `Mr.` to `Broham`. 

And boom goes the dynamite! Now let's try something a little more difficult: create a variable for each passenger's title and last name using the str.extract() method to subset each of the character strings in the name variable. 

Python strings can be specified in raw form by placing an r in front of the opening quote. This allows backslashes to be used as backslashes without needing to be escaped, so, r'\d' would be the same as '\\d'. 

A comma separates a person's last name from their title and first name in the name column. We can extract the last name using the greedy `()` because we expect one or more characters, `+`, in the name. These characters of the name will not be a comma, `[^,]`.

The title is certainly more than one character, `+`, that is not a period, `[^\.]` that follows a comma and space,` ,` and is followed by a period, `\.`.


In [None]:
titanic = (
    titanic
        .assign(
            last=lambda df: 
                df['name'].str.extract('([^,]+),', expand=True),
            title=lambda df: 
                df['name'].str.extract(', ([^\\.]+)\\.', expand=True)))
      

(titanic
    .loc[:, ['name', 'title', 'last']]
    .head()
    .pipe(print))

## Lambda Functions
A lambda function in Python is an unnamed function that is defined using the lambda keyword. Lambda functions are also referred to as "anonymous functions". They are typically used for creating one-line functions or as arguments to functions that accept a function as an argument.

For example, you can create a lambda function that takes the log of a number like this:


In [None]:
import numpy as np
log = lambda x: np.log(x)

print(log(1))

Lambda functions are particularly useful when you need a throwaway function for a specific purpose. However, they are limited in their capabilities because they can only contain a single expression. If you need a more complex or multi-statement function it's better to use a regular user-defined function.






## Practice


1. What is the most common first name of passengers?
2. Two common designations for women appear in the title variable, Mrs. and Miss. None of the other titles meet both of these conditions. Try to match the string using a regular expression and create a new column that indicates whether the passenger was a Mrs. of Miss. (name is your choice). Note that the str.contains() method returns a logical variable indicating if the searched for string was found. Ask for help as necessary and use the codeblocks above to guide your way.

# Exploring the data

Exploratory data analysis (EDA) is the first step in analyzing data. This is the step BEFORE hypothesis testing. EDA is necessary to:
* Identify measurement errors and outliers
* Help with model selection
* Map out relationships between explanatory variables
* Check correlation between DV and Ivs

People are not very good at looking at a column of numbers or a whole spreadsheet and then determining important characteristics of the data.  They find looking at numbers to be tedious, boring, and/or overwhelming. Exploratory data analysis techniques have been devised as an aid in this situation. Most of these techniques work in part by hiding certain aspects of the data while making other aspects more clear.

To get started heading in the right direction, it is best to start by answering some questions about your data:

1.  Does time matter?
2.  Do we have a spatial aspect?
3.  Do we have repeated observations?

From there, the next steps for pre-processing the data should be self-evident. A good place to start would be summarizing and tabulating. We already know that pandas `describe()` is used to view some basic statistical details like percentile, mean, std etc. of a data frame or a series of numeric values.

In [None]:
import pandas as pd
import numpy as np

# Define the headers since the data does not have any
headers = ["symboling", "normalized_losses", "make", "fuel_type", "aspiration",
           "num_doors", "body_style", "drive_wheels", "engine_location",
           "wheel_base", "length", "width", "height", "curb_weight",
           "engine_type", "num_cylinders", "engine_size", "fuel_system",
           "bore", "stroke", "compression_ratio", "horsepower", "peak_rpm",
           "city_mpg", "highway_mpg", "price"]

# Read in the CSV file and convert "?" to NaN
df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data",
                  header=None, names=headers, na_values="?" )

df.to_csv(r'Auto_File.csv', index = False)
df.head()

Start by determining if any variables have a lot of missing values. It is also useful to look for patterns of missing values

In [None]:
obj_df = df

In [None]:
obj_df[obj_df.isnull().any(axis=1)]

In [None]:
obj_df = obj_df.fillna({"num_doors": "four"})

In [None]:
obj_df[obj_df.isnull().any(axis=1)]

Categorical variables can be either ordinal and nominal. Ordinal categorical variables are those that have an inherent order. 

Those include classes measuring quality. It is common to assign ordinal categorical variables a numerical value based on rank. 

Nominal categorical variables have no order. It is common to create dummies for each class within a variable with binary values.

A simple tabulation of the frequency of each category is the best univariate non-graphical EDA for categorical data. 

If the quantitative variable does not have too many distinct values, a tabulation, as we used for categorical data, will be a worthwhile univariate, non-graphical technique. But mostly, for quantitative variables we are concerned here with the quantitative numeric (non-graphical) measures which are the various sample statistics. In fact, sample statistics are generally thought of as estimates of the corresponding population parameters.

If categorical variables share a clear relationship to the DV it may make sense to encode those values.

In [None]:
obj_df.dtypes

The data has several variables that are recorded as `objects` rather than numberical data. If we want to generate any calculations using this data we will need to encode these variables. Pandas' `select_dtypes` function can be used to build a new dataframe containing only the `object` columns.

In [None]:
obj_df = df.select_dtypes(include=['object']).copy()
obj_df.head(10)

In [None]:
obj_df.dtypes

One way to make the object entries into numerical variables would simply be to identify and replace them. For example, I happen to know there are only a few engine cylinder configurations, and all rely on somewhere between 1 and 12 cylinders - though 4 and 6 are by far the most common in passenger vehicles. I bet we could easily setup a dictionary that drops the cylinder categories into buckets of numerical cylinder counts. 

In [None]:
obj_df["num_doors"].value_counts()

In [None]:
obj_df["fuel_type"].value_counts()

In [None]:
cleanup_cylinders = {"num_cylinders":     {"two": 2, "four": 4}}

In [None]:
type(cleanup_cylinders)

In [None]:
obj_df = obj_df.replace(cleanup_cylinders)
obj_df.head()

You may think this is tedious, and it is. With STATA you may have used label encoding and thought it superior. Label encoding converts each value in a column to a number. For example, the fuel_system column contains 8 different values. A quick way to encode a variable entails converting a column to a category, then use those category values for your label encoding. It goes like this:

In [None]:
obj_df["fuel_system"].value_counts()

In [None]:
obj_df["fuel_system"] = obj_df["fuel_system"].astype('category')
obj_df.dtypes

Now that it is a category it is possible to assign the encoded variable to a new column using the cat.codes accessor:

In [None]:
obj_df["fuel_system_cat"] = obj_df["fuel_system"].cat.codes
obj_df

In [None]:
df.groupby('fuel_type')['curb_weight'].mean()

In [None]:
df.dtypes

### Practice Exercise

Take some time to practice yourself. There are several other variables saved as objects - how might you generate numerical data from them?

1. Create a numerical variable for the number of cylinders
2. Create a categorical variable for the make of the vehicle
3. Generate a variable that classifies each vehicle as small, medium, or large depending upon some function of the engine size and wheelbase