## **Reading csv files with python**

In [39]:
import csv

with open('../datasets/mpg.csv') as csvfile:
  mpg = list(csv.DictReader(csvfile))

mpg[:3] # The first three dictionaries in our list.

[{'': '1',
  'manufacturer': 'audi',
  'model': 'a4',
  'displ': '1.8',
  'year': '1999',
  'cyl': '4',
  'trans': 'auto(l5)',
  'drv': 'f',
  'cty': '18',
  'hwy': '29',
  'fl': 'p',
  'class': 'compact'},
 {'': '2',
  'manufacturer': 'audi',
  'model': 'a4',
  'displ': '1.8',
  'year': '1999',
  'cyl': '4',
  'trans': 'manual(m5)',
  'drv': 'f',
  'cty': '21',
  'hwy': '29',
  'fl': 'p',
  'class': 'compact'},
 {'': '3',
  'manufacturer': 'audi',
  'model': 'a4',
  'displ': '2',
  'year': '2008',
  'cyl': '4',
  'trans': 'manual(m6)',
  'drv': 'f',
  'cty': '20',
  'hwy': '31',
  'fl': 'p',
  'class': 'compact'}]

**Summing data in a csv file**

In [40]:
sum(float(d['cty']) for d in mpg)

3945.0

## **Numpy**

The Numpy module is mainly used for working with numerical 
data. It provides us with a powerful object known as an Array.
With Arrays, we can perform mathematical operations on 
multiple values in the Arrays at the same time, and also
perform operations between different Arrays, similar to 
matrix operations. 

In [41]:
import numpy as np
import math 

**Array creation**

In [42]:
a = np.array([1,2,3,4])
print(a.ndim)

1


Two dimensional array

In [43]:
b = np.array([[1,2,3,4],[5,6,7,8]])
print(b.ndim)

2


In [44]:
print(b.shape)
# 2 by 4

(2, 4)


Adding default values to an array

In [45]:
c = np.zeros((2,4))
print(c)
d = np.ones((3,2))
print(d)

[[0. 0. 0. 0.]
 [0. 0. 0. 0.]]
[[1. 1.]
 [1. 1.]
 [1. 1.]]


**Array operations**

In [46]:
e = np.arange(0, 10, 2)
f = np.linspace(0, 10, 5)

print(e)
print(f)
print(e+f)
print(e*f)
print(e**2)
print(e-f)
print(e/f)


[0 2 4 6 8]
[ 0.   2.5  5.   7.5 10. ]
[ 0.   4.5  9.  13.5 18. ]
[ 0.  5. 20. 45. 80.]
[ 0  4 16 36 64]
[ 0.  -0.5 -1.  -1.5 -2. ]
[nan 0.8 0.8 0.8 0.8]


  print(e/f)


In [47]:
# Use reshape to change the shape of an array

g = np.array([1,2,3,4])
print(g.reshape(2,2))

[[1 2]
 [3 4]]


In [48]:
g=g*2
print(g)

[2 4 6 8]


## **REGEX**

In [49]:
import re

str = "Amy works diligently. Amy gets good grades. Our student Amy is succesful."

print(re.search("Amy", str))

print(re.split("Amy", str))

print(re.findall("Amy", str))


<re.Match object; span=(0, 3), match='Amy'>
['', ' works diligently. ', ' gets good grades. Our student ', ' is succesful.']
['Amy', 'Amy', 'Amy']


### **Patterns and character classes**

#### **Set Operators**

In [50]:
# Set Operator
grades="ACAAAABCBCBAA"
re.findall("[AB]", grades)

['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'A', 'A']

In [51]:
re.findall("[A][B-C]", grades)

['AC', 'AB']

In [52]:
# You could use the or operator | to get the same result
re.findall("AB|AC", grades)

['AC', 'AB']

**Caret - ^**

In [53]:
# Check if the 1st character is A
re.findall("^A", grades)

['A']

In [54]:
# Check if the first character is B
re.findall("^B", grades)

[]

In [55]:
# Since it is not B, it will return an empty list

**Dollar Sign - $**

In [56]:
# Check if the last character is A
re.findall("A$", grades)

['A']

In [57]:
# Check if the last character is B
re.findall("B$", grades)

[]

In [58]:
# Since the last character is not B, it will return an empty list

In [59]:
# Return a list excluding A
re.findall("[^A]", grades)

['C', 'B', 'C', 'B', 'C', 'B']

In [60]:
# Return an empty list
re.findall("^[^A]", grades)

[]

#### **Quantifiers**


In [61]:
# Quantifiers are the number of times you want a pattern to be matched in order to match. The most basic
# quantifier is expressed as e{m,n}, where e is the expression or character we are matching, m is the minimum
# number of times you want it to matched, and n is the maximum number of times the item could be matched.

# Let's use these grades as an example. How many times has this student been on a back-to-back A's streak?
re.findall("A{2,10}", grades) # 2 is the min and 10 is the max

['AAAA', 'AA']

In [62]:
# So we see that there were two streaks, one where the student had four A's, and one where they had only two
# A's

# We might try and do this using single values and just repeating the pattern
re.findall("A{1,1}A{1,1}", grades)

['AA', 'AA', 'AA']

In [63]:
# As you can see, this is different than the first example. The first pattern is looking for any combination
# of two A's up to ten A's in a row. So it sees four A's as a single streak. The second pattern is looking for
# two A's back to back, so it sees two A's followed immediately by two more A's. We say that the regex
# processor begins at the start of the string and consumes variables which match patterns as it does.

# It's important to note that the regex quantifier syntax does not allow you to deviate from the {m,n}
# pattern. In particular, if you have an extra space in between the braces you'll get an empty result
re.findall("A{2, 2}",grades)

[]

In [64]:
# One number in braces 
re.findall("A{2}",grades)

['AA', 'AA', 'AA']

In [65]:
# Using this, we could find a decreasing trend in a student's grades
re.findall("A{1,10}B{1,10}C{1,10}",grades)

['AAAABC']

## **Pandas**

Pandas is a Python library used for working with data sets. It has functions for analyzing, cleaning, exploring, and manipulating data. The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.

#### **The Series Data Structure in Pandas**

A pandas Series is a one-dimensional array. It holds any data type supported in Python and uses labels to locate each data value for retrieval. These labels form the index, and they can be strings or integers. A Series is the main data structure in the pandas framework for storing one-dimensional data.

The series is one of the core data structures in pandas. You think of it a cross between a list and a dictionary. The items are all stored in an order and there's labels with which you can retrieve them. An easy way to visualize this is two columns of data. The first is the special index, a lot like keys in a dictionary. While the second is your actual data. It's important to note that the data column has a label of its own and can be retrieved using the .name attribute. This is different than with dictionaries and is useful when it comes to merging multiple columns of data. And we'll talk about that later on in the course.

In [66]:
import pandas as pd

students = ['Alice', 'Jack', 'Molly']
pd.Series(students)
# stype is object

0    Alice
1     Jack
2    Molly
dtype: object

In [67]:
numbers = [1, 2, 3]
pd.Series(numbers)
# dtype is int64

0    1
1    2
2    3
dtype: int64

In [68]:
# Adding None to a list of strings
students = ['Alice', 'Jack', None]
pd.Series(students)

0    Alice
1     Jack
2     None
dtype: object

In [69]:
# Above returns dtype as object and None.

# Adding None to a list of numbers
numbers = [1, 2, None]
pd.Series(numbers)

0    1.0
1    2.0
2    NaN
dtype: float64

In [70]:
# Above, I get dtype as float64 and NaN. Nan is not None and when we try to do equality test, it's false.
import numpy as np
np.nan == None

False

In [71]:
# So we can't use equality test to see if the value is missing or not. Instead, we need to use special function 
np.isnan(np.nan)

True

In [72]:
# Here, we shall convert a dictionary to a series
students_scores = {'Alice': 'Physics', 'Jack': 'Chemistry', 'Molly': 'English'}
s = pd.Series(students_scores)

In [73]:
# The indices in the series are built from the keys of the dictionary that we provided. We can override this
s.index
# Here, we got the index attribute using .index

Index(['Alice', 'Jack', 'Molly'], dtype='object')

In [75]:
s = pd.Series(students_scores, index=['Bob', 'Jack', 'Paul'])
s
# Here, we were trying to return each of the students and their scores. But since Bob and Paul are not in the series, we get NaN.

Bob           NaN
Jack    Chemistry
Paul          NaN
dtype: object

In [76]:
# You can also separate your index creation from the data by passing in the index as a 
# list explicitly to the series.

s = pd.Series(['Physics', 'Chemistry', 'English'], index=['Alice', 'Jack', 'Molly'])
s

Alice      Physics
Jack     Chemistry
Molly      English
dtype: object

In [78]:
# Here, we shall convert a dictionary to a series
students_scores = {'Alice': 'Physics', 'Jack': 'Chemistry', 'Molly': 'English'}
s = pd.Series(students_scores)

#### **Querying Series**

In [84]:
student_classes = {'Alice': 'Physics', 'Jack': 'Chemistry', 'Molly': 'English', 'Sam': 'History'}
s2 = pd.Series(student_classes)

print(s2.iloc[2])

print(s2.loc['Molly'])

# Keep in mind that iloc and loc are not methods, they are attributes. So you don't use 
# parentheses to query them, but square brackets instead, which is called the indexing operator. 
# In Python this calls get or set for an item depending on the context of its use.

English
English


### **DataFrames Data Structure**

The pandas DataFrame. A DataFrame is a two-dimensional data structure composed of rows and columns — exactly like a simple spreadsheet or a SQL table. Each column of a DataFrame is a pandas Series. These columns should be of the same length, but they can be of different data types — float, int, bool, and so on.

In [89]:
record_1 = pd.Series({'Name': 'Alice', 'Class': 'Physics', 'Score': 85})
record_2 = pd.Series({'Name': 'Jack', 'Class': 'Chemistry', 'Score': 82})
record_3 = pd.Series({'Name': 'Helen', 'Class': 'Biology', 'Score': 90})

df = pd.DataFrame([record_1, record_2, record_3])
print(df)
print('Without giving it and index, pandas automatically uses integers starting from 0 up to n-1, where n is the number of rows in the data frame.')
print('---------------------------------------------------------------------------------------------------------------------------------')
df = pd.DataFrame([record_1, record_2, record_3], index=['school1', 'school2', 'school3'])
print(df)
print('This is how we can insert our index.')

    Name      Class  Score
0  Alice    Physics     85
1   Jack  Chemistry     82
2  Helen    Biology     90
Without giving it and index, pandas automatically uses integers starting from 0 up to n-1, where n is the number of rows in the data frame.
---------------------------------------------------------------------------------------------------------------------------------
          Name      Class  Score
school1  Alice    Physics     85
school2   Jack  Chemistry     82
school3  Helen    Biology     90
This is how we can insert our index.


In [90]:
# You can even use a list of dictionaries to insert into the data frame. 

students = [{'Name': 'Alice', 'Class': 'Physics', 'Score': 85}, {'Name': 'Jack', 'Class': 'Chemistry', 'Score': 82}, {'Name': 'Helen', 'Class': 'Biology', 'Score': 90}]
df = pd.DataFrame(students, index=['school1', 'school2', 'school1'])
df

Unnamed: 0,Name,Class,Score
school1,Alice,Physics,85
school2,Jack,Chemistry,82
school1,Helen,Biology,90


In [96]:
print(df.loc['school2'])
print(type(df.loc['school2']))
print('For a row, the returned labels are from the columns.')
print('---------------------------------------------------------------------------------------------------------------------------------')
print(df.loc['school1'])
print(type(df.loc['school1']))
print('It\'s important to remember that the indices and column names along either axes horizontal or vertical, could be non-unique. In this example, we see two records for school1 as different rows. If we use a single value with the DataFrame lock attribute multiple rows of the DataFrame will return, not as a new series, but as a new DataFrame.')

Name          Jack
Class    Chemistry
Score           82
Name: school2, dtype: object
<class 'pandas.core.series.Series'>
For a row, the returned labels are from the columns.
---------------------------------------------------------------------------------------------------------------------------------
          Name    Class  Score
school1  Alice  Physics     85
school1  Helen  Biology     90
<class 'pandas.core.frame.DataFrame'>
It's important to remember that the indices and column names along either axes horizontal or vertical, could be non-unique. In this example, we see two records for school1 as different rows. If we use a single value with the DataFrame lock attribute multiple rows of the DataFrame will return, not as a new series, but as a new DataFrame.


In [97]:
# One of the powers of the Panda's DataFrame is that you can quickly select data based on multiple axes.
# For instance, if you wanted to just list the student names for school1, you would supply two 
# parameters to .loc, one being the row index and the other being the column name.

# For instance, if we are only interested in school1's student names
df.loc['school1', 'Name']

school1    Alice
school1    Helen
Name: Name, dtype: object

In [None]:
# Remember, just like the Series, the pandas developers have implemented this using the indexing
# operator and not as parameters to a function.

# What would we do if we just wanted to select a single column though? Well, there are a few
# mechanisms. Firstly, we could transpose the matrix. This pivots all of the rows into columns
# and all of the columns into rows, and is done with the T attribute
df.T

<h3>Please refer to half this lesson to understand wtf is going on <a href='/Week-2/3-DataFrameDataStructure_ed.ipynb'>here</a></h3>

### **DataFrame Indexing and Loading**


In [99]:
!cat "../datasets/Admission_Predict.csv"

Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR ,CGPA,Research,Chance of Admit 
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4,4.5,8.87,1,0.76
3,316,104,3,3,3.5,8,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2,3,8.21,0,0.65
6,330,115,5,4.5,3,9.34,1,0.9
7,321,109,3,3,4,8.2,1,0.75
8,308,101,2,3,4,7.9,0,0.68
9,302,102,1,2,1.5,8,0,0.5
10,323,108,3,3.5,3,8.6,0,0.45
11,325,106,3,3.5,4,8.4,1,0.52
12,327,111,4,4,4.5,9,1,0.84
13,328,112,4,4,4.5,9.1,1,0.78
14,307,109,3,4,3,8,1,0.62
15,311,104,3,3.5,2,8.2,1,0.61
16,314,105,3,3.5,2.5,8.3,0,0.54
17,317,107,3,4,3,8.7,0,0.66
18,319,106,3,4,3,8,1,0.65
19,318,110,3,4,3,8.8,0,0.63
20,303,102,3,3.5,3,8.5,0,0.62
21,312,107,3,3,2,7.9,1,0.64
22,325,114,4,3,2,8.4,0,0.7
23,328,116,5,5,5,9.5,1,0.94
24,334,119,5,5,4.5,9.7,1,0.95
25,336,119,5,4,3.5,9.8,1,0.97
26,340,120,5,4.5,4.5,9.6,1,0.94
27,322,109,5,4.5,3.5,8.8,0,0.76
28,298,98,2,1.5,2.5,7.5,1,0.44
29,295,93,1,2,2,7.2,0,0.46
30,310,99,2,1.5,2,7.3,0,0.54
31,300,97,2,3,3,8.1,1,0.65
32,327,103,3,

In [103]:
# Pandas made it easy to turn a csv file into a data frame
df = pd.read_csv('../datasets/Admission_Predict.csv')
df.head()
print('--------------------------------------------------------------------------------------------------')


--------------------------------------------------------------------------------------------------


Unnamed: 0_level_0,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.00,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.80
5,314,103,2,2.0,3.0,8.21,0,0.65
...,...,...,...,...,...,...,...,...
396,324,110,3,3.5,3.5,9.04,1,0.82
397,325,107,3,3.0,3.5,9.11,1,0.84
398,330,116,4,5.0,4.5,9.45,1,0.91
399,312,103,3,3.5,4.0,8.78,0,0.67


In [104]:
# You can make the serial number an index
df = pd.read_csv('../datasets/Admission_Predict.csv', index_col=0)
df.head()

Unnamed: 0_level_0,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2.0,3.0,8.21,0,0.65


In [114]:
df.columns
# we can see the white spaces after LOR and chance of Admit

Index(['GRE Score', 'TOEFL Score', 'University Rating', 'SOP', 'LOR ', 'CGPA',
       'Research', 'Chance of Admit '],
      dtype='object')

In [106]:
# Renaming columns
new_df = df.rename(columns={'LOR ': 'Letter of Recommendation', 'Chance of Admit ': 'Chance of Admit'})
# Since they have white spaces, then yo should add them when renaming as well
new_df.head()

Unnamed: 0_level_0,GRE Score,TOEFL Score,University Rating,SOP,Letter of Recommendation,CGPA,Research,Chance of Admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2.0,3.0,8.21,0,0.65


In [107]:
# You can use the .columns attribute to view the columns
new_df.columns

Index(['GRE Score', 'TOEFL Score', 'University Rating', 'SOP',
       'Letter of Recommendation', 'CGPA', 'Research', 'Chance of Admit'],
      dtype='object')

In [112]:
# To remove white spaces, you can use the .strip() method
new_df.columns = new_df.columns.str.strip()
new_df.head()

Unnamed: 0_level_0,GRE Score,TOEFL Score,University Rating,SOP,Letter of Recommendation,CGPA,Research,Chance of Admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2.0,3.0,8.21,0,0.65


In [113]:
new_df.columns

Index(['GRE Score', 'TOEFL Score', 'University Rating', 'SOP',
       'Letter of Recommendation', 'CGPA', 'Research', 'Chance of Admit'],
      dtype='object')