## CIS 9
## Pandas, Data Analysis, Data Cleaning

Reading
<br>Python Data Science Handbook Chapter 3
- Introducing Pandas Objects
- Data Indexing and Selection
- Handling Missing Data, section on NaN
- Combining Datasets: Concat and Append, section on concat
- Aggregation and Grouping, section on groupby
- Vectorized String Operations, up to but not including the Example Recipe Database

Comparison of different data storage:
- A _Python list_ can store different types of data and can change size, but the flexibility makes indexing and calculation of data in a list slow.
- A _numpy array_ can only store only one data type and has fixed size, therefore indexing and calculation of data in a numpy array is very fast.
- A _pandas data structure_ can store different types of data, so indexing data is a little slower than numpy but still faster than a list, and when calculations are done with numeric data, they are done using numpy and are very fast. 
<br>For data analysis purpose, this is best of both worlds! We get some of the flexibility and all the calculation speed.
<br><br>A pandas DataFrame (a 2D structure) is the workhorse of data analysis.

Import libraries

In [34]:
import pandas as pd
import numpy as np   
# Pandas doesn't need importing of numpy, this import is for when we need numpy directly.

Pandas __Series__: 1D sequence of data

In [35]:
# 1. A Series is similar to a Python list, with data and indices

nums = pd.Series([1,5,2,8,3])
print(nums)
nums

0    1
1    5
2    2
3    8
4    3
dtype: int64


0    1
1    5
2    2
3    8
4    3
dtype: int64

In [36]:
# accessing data in a Series

print("nums.values:", nums.values, '\n')
print("nums.index:", nums.index, '\n')
print("using index value 0:", nums[0], '\n')
print("using a slice:")
nums[:3]

# Note that there is no negative indexing in pandas

nums.values: [1 5 2 8 3] 

nums.index: RangeIndex(start=0, stop=5, step=1) 

using index value 0: 1 

using a slice:


0    1
1    5
2    2
dtype: int64

In [37]:
# 2. Internally, numeric data are stored in a numpy array

nums = pd.Series([0, -2.5, 8, -.7, 3])
print(type(nums[0]))

# and numpy operations can be used with Series that have numeric data
np.sum(nums)     

<class 'numpy.float64'>


7.8

In [38]:
# 3. A Series is more flexible than a Python list because we can customize the indices.
# In this way, a Series behaves similarly to a Python dictionary

nums = pd.Series([99, 85, 72, 89], index=['quiz1', 'quiz2', 'quiz3', 'quiz4'])
print(nums, '\n')
print("Quiz 1:", nums['quiz1'])

# Pandas provides an easier way to type when accessing a column
# if the column name is a text string:
print("Quiz 1:", nums.quiz1)

quiz1    99
quiz2    85
quiz3    72
quiz4    89
dtype: int64 

Quiz 1: 99
Quiz 1: 99


In [39]:
# 4. In addition to creating a Series from a Python list, we can create a Series
# from a Python dictionary

d = {c:ord(c) for c in "ABCDE"}
print("dictionary:", d, "\n")

letters = pd.Series(d)
print(letters, '\n')

dictionary: {'A': 65, 'B': 66, 'C': 67, 'D': 68, 'E': 69} 

A    65
B    66
C    67
D    68
E    69
dtype: int64 



In [40]:
# Use indices that are strings in the same way as with numeric indices

print("selecting index A:", letters.A, '\n')

# what's a second syntax to select index A?
## letters["A"]
print(letters["A"], "\n")

print("selecting a slice:")
print(letters['C':'E'], '\n')

print("Selecting with a list:")
print(letters[['A','D']])

selecting index A: 65 

65 

selecting a slice:
C    67
D    68
E    69
dtype: int64 

Selecting with a list:
A    65
D    68
dtype: int64


---

Pandas __Dataframe__: 2D table

In [41]:
# 5. A DataFrame is a 2D table with rows and columns, similar to a Python list of lists or
# a numpy 2D array
df = pd.DataFrame([ [90, 92], [73, 82],[79, 80], [97, 95] ])
print(df, "\n")
df                    # Note the Python print() vs the Jupyter notebook print

    0   1
0  90  92
1  73  82
2  79  80
3  97  95 



Unnamed: 0,0,1
0,90,92
1,73,82
2,79,80
3,97,95


The 0 and 1 at the top of each column are the _column labels_<br>
The 0, 1, 2, 3 on the left of each row are the _row indices_<br>
The column labels and row indices are how we select a particular row and column

In [42]:
print("column labels:", df.columns)
print(df.columns.values)
print()
print("row indices:", df.index)
print(df.index.values)

column labels: RangeIndex(start=0, stop=2, step=1)
[0 1]

row indices: RangeIndex(start=0, stop=4, step=1)
[0 1 2 3]


In [43]:
# Internally, a DataFrame is made up of multiple Series,
# each column is a Series

print(type(df[0]), type(df[1]))

<class 'pandas.core.series.Series'> <class 'pandas.core.series.Series'>


In [44]:
# A DataFrame can be created from a Series

newDF = letters.to_frame()    # letters Series from #4 above
print(type(newDF))
newDF

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,0
A,65
B,66
C,67
D,68
E,69


In [45]:
# 6. Just like with Series, we can customize the column labels.

df = pd.DataFrame(columns=["quiz1", "quiz2"],
                  data=[ [90, 92], [73, 82],[79, 80], [97, 95] ])
print(df, '\n')   

print("median of quiz 2:", np.median(df.quiz2))

# Why do numpy operations work on a column of a DataFrame?
## Columns of a DataFrame are similar to a 1D numpy array.

   quiz1  quiz2
0     90     92
1     73     82
2     79     80
3     97     95 

median of quiz 2: 87.0


In [46]:
# 7. An advantage of a DataFrame is that each column can have its own type of data

df = pd.DataFrame(columns=["Names", "quiz1", "quiz2"],
                  data=[ ["Fred",90,92.5], ["Wilma",73,82],["Barney",79,80], ["Betty",90,95] ])
df
# The Names column contains strings, the quiz1 column contains ints, 
# the quiz2 column contains floats

# Why does quiz2 contain floats while quiz1 contains ints?
## Since the quiz2 value for row 0 is a float, all other values in the quiz2 column are assumed to be floats and not ints.

Unnamed: 0,Names,quiz1,quiz2
0,Fred,90,92.5
1,Wilma,73,82.0
2,Barney,79,80.0
3,Betty,90,95.0


In [47]:
# In the above example, a DataFrame is created from a list of lists, where each inner list is a row
# A DataFrame can also be created from a dictionary, where each dictionary value (a list) is a column

df = pd.DataFrame({"Names":"Fred Wilma Barney Betty".split(),
                   "quiz1":[90, 73, 79, 90],
                   "quiz2":[92.5, 82, 80, 95] })
df

Unnamed: 0,Names,quiz1,quiz2
0,Fred,90,92.5
1,Wilma,73,82.0
2,Barney,79,80.0
3,Betty,90,95.0


__Accessing__ data

In [48]:
# 8. Selecting columns

# We've seen the . (dot) notation to index a column:
print(df.quiz1, '\n')
# or [] notation to index a column:
print(df["quiz1"], '\n')

# To select a slice of column labels that are strings,
# Use the .columns attribute 

print("selecting columns at index 1 and 2:", df.columns[1:3])
# now use the selected column to get the data
print(df[df.columns[1:3]], '\n')

0    90
1    73
2    79
3    90
Name: quiz1, dtype: int64 

0    90
1    73
2    79
3    90
Name: quiz1, dtype: int64 

selecting columns at index 1 and 2: Index(['quiz1', 'quiz2'], dtype='object')
   quiz1  quiz2
0     90   92.5
1     73   82.0
2     79   80.0
3     90   95.0 



In [49]:
df

Unnamed: 0,Names,quiz1,quiz2
0,Fred,90,92.5
1,Wilma,73,82.0
2,Barney,79,80.0
3,Betty,90,95.0


In [50]:
# Selecting rows
# There is no df.rows, this is because a DataFrame is made of multiple Series
# that are columns

# To select rows, use the .loc attribute:
print("first row:")
print(df.loc[0], '\n')

print("rows with index 1,2,3:")
print(df.loc[1:3], '\n')     # Note the *inclusive ending* for .loc

print("all rows, with selected columns:")
print(df.loc[:,['quiz1','quiz2']],'\n')

print("one row and column, or 1 element:")
print(df.loc[2,['quiz1']],'\n')

# When accessing a single element, it's faster to use .at:
print("better way to access one element:", df.at[2,'quiz1'],'\n')

first row:
Names    Fred
quiz1      90
quiz2    92.5
Name: 0, dtype: object 

rows with index 1,2,3:
    Names  quiz1  quiz2
1   Wilma     73   82.0
2  Barney     79   80.0
3   Betty     90   95.0 

all rows, with selected columns:
   quiz1  quiz2
0     90   92.5
1     73   82.0
2     79   80.0
3     90   95.0 

one row and column, or 1 element:
quiz1    79
Name: 2, dtype: object 

better way to access one element: 79 



In [51]:
df

Unnamed: 0,Names,quiz1,quiz2
0,Fred,90,92.5
1,Wilma,73,82.0
2,Barney,79,80.0
3,Betty,90,95.0


In [52]:
# 9. Boolean indexing:

print(df[df.quiz1 == 90], '\n')

print(df[df.quiz2 < 90],'\n')

print(df[df.Names == "Betty"],'\n')

# Write 1 print statement to print the names of students        ** EC **
# with quiz2 score greater than 90
print(df[df.quiz2 > 90].Names)

   Names  quiz1  quiz2
0   Fred     90   92.5
3  Betty     90   95.0 

    Names  quiz1  quiz2
1   Wilma     73   82.0
2  Barney     79   80.0 

   Names  quiz1  quiz2
3  Betty     90   95.0 

0     Fred
3    Betty
Name: Names, dtype: object


In [53]:
# Accessing data with 2 conditions

print(df[(df.quiz1>=90) & (df.quiz2>=90)])
print()
print(df[(df.quiz1>=90) | (df.quiz2>80)])

   Names  quiz1  quiz2
0   Fred     90   92.5
3  Betty     90   95.0

   Names  quiz1  quiz2
0   Fred     90   92.5
1  Wilma     73   82.0
3  Betty     90   95.0


In [54]:
df

Unnamed: 0,Names,quiz1,quiz2
0,Fred,90,92.5
1,Wilma,73,82.0
2,Barney,79,80.0
3,Betty,90,95.0


In [55]:
# 10. Getting unique values (no duplicates) in a column
print("unique data in quiz1:", df.quiz1.unique())
print()

# count the number of occurrences of data in a column
print("count of occurrences in quiz1:")
print(df.quiz1.value_counts())

unique data in quiz1: [90 73 79]

count of occurrences in quiz1:
90    2
73    1
79    1
Name: quiz1, dtype: int64


---

__Reading__ from files

In [56]:
# 11. If the file is a column of data, it will be read into a Series

quiz1 = pd.read_csv("quiz_scores.csv")
print(quiz1, '\n')

   quiz1
0     43
1     33
2     48
3     40
4     46
5     48
6     38
7     41 



In [57]:
# If the file is a csv file with rows and columns, it will be read into a DataFrame

gradebook = pd.read_csv("scores.csv")
print("row index:", gradebook.index)
gradebook

row index: RangeIndex(start=0, stop=8, step=1)


Unnamed: 0,Student,quiz1,midterm,quiz2,final
0,Sleepy,43,34.0,34,35
1,Happy,33,18.0,23,50
2,Doc,48,42.0,36,37
3,Grumpy,40,23.5,40,45
4,Bashful,46,42.5,46,41
5,Sneezy,48,39.5,48,43
6,Dopey,38,45.0,39,32
7,Snow White,41,44.0,39,41


In [58]:
gradebook = pd.read_csv("scores.csv", index_col='Student')
print("row index:", gradebook.index)
gradebook

row index: Index(['Sleepy', 'Happy', 'Doc', 'Grumpy', 'Bashful', 'Sneezy', 'Dopey',
       'Snow White'],
      dtype='object', name='Student')


Unnamed: 0_level_0,quiz1,midterm,quiz2,final
Student,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Sleepy,43,34.0,34,35
Happy,33,18.0,23,50
Doc,48,42.0,36,37
Grumpy,40,23.5,40,45
Bashful,46,42.5,46,41
Sneezy,48,39.5,48,43
Dopey,38,45.0,39,32
Snow White,41,44.0,39,41


In [59]:
# what's different about how the file is read in here?
# 

gradebook = pd.read_csv("scores.csv", header=0, names=["name","q1","midt","q2","fin"])
gradebook

Unnamed: 0,name,q1,midt,q2,fin
0,Sleepy,43,34.0,34,35
1,Happy,33,18.0,23,50
2,Doc,48,42.0,36,37
3,Grumpy,40,23.5,40,45
4,Bashful,46,42.5,46,41
5,Sneezy,48,39.5,48,43
6,Dopey,38,45.0,39,32
7,Snow White,41,44.0,39,41


In [60]:
# what about this way of reading in the file?
# 

gradebook = pd.read_csv("scores.csv", header=0, names=["q1","midt","q2","fin"])
gradebook

Unnamed: 0,q1,midt,q2,fin
Sleepy,43,34.0,34,35
Happy,33,18.0,23,50
Doc,48,42.0,36,37
Grumpy,40,23.5,40,45
Bashful,46,42.5,46,41
Sneezy,48,39.5,48,43
Dopey,38,45.0,39,32
Snow White,41,44.0,39,41


In [61]:
# We can also read from Excel files (among many other common types: HTML, JSON, SQL, etc.)

gradebook = pd.read_excel("scores.xlsx")
gradebook

Unnamed: 0,Student,quiz1,midterm,quiz2,final
0,Sleepy,43,34.0,34,35
1,Happy,33,20.0,23,49
2,Doc,48,32.0,36,37
3,Grumpy,40,23.5,40,45
4,Bashful,46,42.5,46,31
5,Sneezy,48,38.5,48,43
6,Dopey,38,45.0,39,32
7,Snow White,41,48.0,39,41


In [62]:
gradebook = pd.read_excel("scores.xlsx", index_col='Student')
gradebook

Unnamed: 0_level_0,quiz1,midterm,quiz2,final
Student,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Sleepy,43,34.0,34,35
Happy,33,20.0,23,49
Doc,48,32.0,36,37
Grumpy,40,23.5,40,45
Bashful,46,42.5,46,31
Sneezy,48,38.5,48,43
Dopey,38,45.0,39,32
Snow White,41,48.0,39,41


In [63]:
# 12. It's possible to set and reset the row index 
gradebook = pd.read_excel("scores.xlsx")
print(gradebook, "\n")

nameIndex = gradebook.set_index("Student")
nameIndex

      Student  quiz1  midterm  quiz2  final
0      Sleepy     43     34.0     34     35
1       Happy     33     20.0     23     49
2         Doc     48     32.0     36     37
3      Grumpy     40     23.5     40     45
4     Bashful     46     42.5     46     31
5      Sneezy     48     38.5     48     43
6       Dopey     38     45.0     39     32
7  Snow White     41     48.0     39     41 



Unnamed: 0_level_0,quiz1,midterm,quiz2,final
Student,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Sleepy,43,34.0,34,35
Happy,33,20.0,23,49
Doc,48,32.0,36,37
Grumpy,40,23.5,40,45
Bashful,46,42.5,46,31
Sneezy,48,38.5,48,43
Dopey,38,45.0,39,32
Snow White,41,48.0,39,41


In [64]:
nameIndex.loc["Doc"]

quiz1      48.0
midterm    32.0
quiz2      36.0
final      37.0
Name: Doc, dtype: float64

In [65]:
gb = nameIndex.reset_index()
gb

Unnamed: 0,Student,quiz1,midterm,quiz2,final
0,Sleepy,43,34.0,34,35
1,Happy,33,20.0,23,49
2,Doc,48,32.0,36,37
3,Grumpy,40,23.5,40,45
4,Bashful,46,42.5,46,31
5,Sneezy,48,38.5,48,43
6,Dopey,38,45.0,39,32
7,Snow White,41,48.0,39,41


In [66]:
gb.loc[2]

Student     Doc
quiz1        48
midterm    32.0
quiz2        36
final        37
Name: 2, dtype: object

In [67]:
# 13. From the gradebook in the cell above:
# print Dopey's quiz1 and quiz2 values?

print(gradebook.loc['Dopey',['quiz1', 'quiz2']])

KeyError: 'Dopey'

Show __attributes__ of the DataFrame

In [68]:
# 14. We've already seen the column labels and row indices
print("index:")
print(gradebook.index)          
print(gradebook.index.values)
print()
print("labels:")
print(gradebook.columns)  
print(gradebook.columns.values)
print()
print("shape:", gradebook.shape)
print("size:", gradebook.size)
print("len:", len(gradebook))
print()
print("first part:", gradebook.head(), '\n')
print("last part:", gradebook.tail(3))

# what's the difference between no input argument for head() or tail()
# and having a number as input argument?
# 

index:
RangeIndex(start=0, stop=8, step=1)
[0 1 2 3 4 5 6 7]

labels:
Index(['Student', 'quiz1', 'midterm', 'quiz2', 'final'], dtype='object')
['Student' 'quiz1' 'midterm' 'quiz2' 'final']

shape: (8, 5)
size: 40
len: 8

first part:    Student  quiz1  midterm  quiz2  final
0   Sleepy     43     34.0     34     35
1    Happy     33     20.0     23     49
2      Doc     48     32.0     36     37
3   Grumpy     40     23.5     40     45
4  Bashful     46     42.5     46     31 

last part:       Student  quiz1  midterm  quiz2  final
5      Sneezy     48     38.5     48     43
6       Dopey     38     45.0     39     32
7  Snow White     41     48.0     39     41


---

### Data Analysis

Basic __statistics__

In [69]:
# 15. We can get all the basic stats in one method
gradebook.describe()

# Review statistics and data analysis:            ** EC **
# You are the teacher for this class, and as a good teacher, you want to improve your 
# class material.
# Run the cell so you can see the statistics for the exams: quiz1, midterm, quiz2, final
# Using the statistics, you will need to improve the class material for which exam?
# Explain your choice by citing specific statistic values to explain how they show the 
# the need to improve.
## I will need to improve the class material for the midterm. This is because out of all the exams, the mean percentage,
## found by dividing the mean of the scores for an exam by the max possible score, is the lowest compared to the other exams.

Unnamed: 0,quiz1,midterm,quiz2,final
count,8.0,8.0,8.0,8.0
mean,42.125,35.4375,38.125,39.125
std,5.221863,10.022965,7.698562,6.424006
min,33.0,20.0,23.0,31.0
25%,39.5,29.875,35.5,34.25
50%,42.0,36.25,39.0,39.0
75%,46.5,43.125,41.5,43.5
max,48.0,48.0,48.0,49.0


In [70]:
# 16. To get a specific statistic for a specific column, we use numpy

print(np.median(gradebook.quiz1))
print(np.mean(gradebook.quiz2), '\n')

# or pandas
print(gradebook.quiz2.mean(), '\n')

# We can also get all statistics of one column
gradebook.quiz2.describe()

42.0
38.125 

38.125 



count     8.000000
mean     38.125000
std       7.698562
min      23.000000
25%      35.500000
50%      39.000000
75%      41.500000
max      48.000000
Name: quiz2, dtype: float64

In [71]:
# 17. # For data analysis purpose, all scores in the examples below are out of 50 pts.

# Show students who earned more than 90% in their final
print(gradebook[gradebook.final > 50*.9], '\n')

# Show students who earned more than 90% in their final
print(gradebook[gradebook.final > 50*.8], '\n')

# Show the number of students who earned more than 80% in their final?


  Student  quiz1  midterm  quiz2  final
1   Happy     33     20.0     23     49 

      Student  quiz1  midterm  quiz2  final
1       Happy     33     20.0     23     49
3      Grumpy     40     23.5     40     45
5      Sneezy     48     38.5     48     43
7  Snow White     41     48.0     39     41 



Basic __Calculations__

In [72]:
# 18. Assume the midterm and final are each worth 30% of the grade, and quiz1 and quiz2 
# are each worth 20% of the grade. 
# This means 60% of the grade comes from the midterm and final, and 40% of the grade 
# comes from the quizzes.

# We want to calculate the weighted average of the exams. 
# and we want the score to be out of 100 to make it easier to see the percentage.

wtAvg=(.2 * gradebook.quiz1 + .2 * gradebook.quiz2 + 
       .3 * gradebook.midterm + .3 * gradebook.final)
print("weighted average:")
print(wtAvg)

# For each student, show the wtAvg above as a percentage?           ** EC **
# Recall that the scores are out of 50, so someone with a weighted average
# of 25 would be at 50%


weighted average:
0    36.10
1    31.90
2    37.50
3    36.55
4    40.45
5    43.65
6    38.50
7    42.70
dtype: float64


**Sorting**

In [None]:
# 19. Sort by a column

print(gradebook, '\n')
print(gradebook.sort_values(by="quiz1"), '\n')
print(gradebook.sort_values(by="quiz1", ascending=False), '\n')

**Changing shape**

In [None]:
# 20. Remove rows
gradebook = pd.read_excel("scores.xlsx", index_col='Student')
print(gradebook,'\n')

print("remove rows:")
print(gradebook.drop(["Sneezy","Happy"]),'\n')
print(gradebook,'\n')

gradebook.drop(["Sneezy","Happy"], inplace=True)
print(gradebook,'\n')

In [None]:
# Remove columns
gradebook.drop(columns=['quiz2'], inplace=True),'\n'
gradebook

In [None]:
# 21. Adding from another DataFrame along the rows

gradebook = pd.read_excel("scores.xlsx", index_col='Student')
print(gradebook, "\n")
stInfo = pd.read_excel("ids.xlsx", index_col='Student')
print(stInfo, "\n")

print("Concatenating:")
data = pd.concat([stInfo, gradebook], axis=1)
data
#data = pd.concat([stInfo, gradebook])  # axis=0
#data

# the row indices have to be identical to concatenate 2 DataFrames

In [None]:
# 22. Adding from another DataFrame along the columns

gradebook = pd.read_excel("scores.xlsx")
print(gradebook, "\n")
newrow = pd.DataFrame(columns=['Student','quiz1','midterm','quiz2','final'],
                      data=[ ["New Kid",30,30,30,30] ])
print(newrow, "\n")

print("Appending:")
gradebook = gradebook.append(newrow, ignore_index=True)
gradebook

In [None]:
# append a dictionary

#gradebook = pd.read_excel("scores.xlsx")
d = dict(zip(['Student','quiz1','midterm','quiz2','final'],["New Kid2",30,30,30,30]))
print(d)
gradebook = gradebook.append(d, ignore_index=True)
gradebook

__groupby__ for data aggregation

In [None]:
# 23. groupby can be used to group data together when there are specific 
# categories in a column

print(data, "\n")
print("groupby object:", data.groupby("year"), "\n")
print(data.groupby("year").mean(), '\n')

# The above output shows the mean of the id's, which doesn't make sense.
# Show the mean of the exams only?


### Data Cleaning

Missing data or __NaN__

In [73]:
# 24. When data is read in to a DataFrame and some values are missing, the missing values 
# appear as NaN values in the DataFrame. NaN is the IEEE defined value for Not a Number.
data = pd.read_csv("classes.csv")   # empty field in CSV file
print("original data:")
print(data, '\n')

# remove data records (rows) with NaN
cleanedData = data.dropna()
print("drop NaN:")
print(cleanedData, '\n')

# replace NaN with some default value
subbedData = data.fillna(0)
print("Replace NaN:")
print(subbedData, '\n')

# check for NaN in the DataFrame
print("check for NaN:")
print(data[data.isna().any(axis=1)], "\n")
print(data.isna().sum())

original data:
    Class Days     Time  Number of Units  Number of Students Location
0    CIS3   MW   9:30am              4.0                45.0  De Anza
1  CIS22A   MW  11:30am              4.5                 NaN  De Anza
2  CIS41A  TTH   9:30am              4.5                47.0  De Anza
3  CIS18B   MW   1:30pm              4.5                 NaN  De Anza 

drop NaN:
    Class Days    Time  Number of Units  Number of Students Location
0    CIS3   MW  9:30am              4.0                45.0  De Anza
2  CIS41A  TTH  9:30am              4.5                47.0  De Anza 

Replace NaN:
    Class Days     Time  Number of Units  Number of Students Location
0    CIS3   MW   9:30am              4.0                45.0  De Anza
1  CIS22A   MW  11:30am              4.5                 0.0  De Anza
2  CIS41A  TTH   9:30am              4.5                47.0  De Anza
3  CIS18B   MW   1:30pm              4.5                 0.0  De Anza 

check for NaN:
    Class Days     Time  Number of

In [74]:
# 25. NaN with numpy
print("numpy:")
print(np.median(data['Number of Students']))
print(np.median(cleanedData['Number of Students']), '\n')

# NaN with pandas
print("pandas:")
print(data['Number of Students'].median())

numpy:
nan
46.0 

pandas:
46.0


Change column labels: __string vectorization__

In [75]:
# 26. As seen from the cell above, it's more convenient to have a shorter column label.
# Simplify the data.columns (column labels) so it's easier to type.
# a. change the column labels so they're all lowercase

data.columns = data.columns.str.lower()
data

Unnamed: 0,class,days,time,number of units,number of students,location
0,CIS3,MW,9:30am,4.0,45.0,De Anza
1,CIS22A,MW,11:30am,4.5,,De Anza
2,CIS41A,TTH,9:30am,4.5,47.0,De Anza
3,CIS18B,MW,1:30pm,4.5,,De Anza


In [76]:
# b. change column labels to 1 word: class, days, time, units, students ?
# You'll need to do the reading for this answer

data.columns = data.columns.str.extract('([a-z]+)$',expand=False)
data

Unnamed: 0,class,days,time,units,students,location
0,CIS3,MW,9:30am,4.0,45.0,De Anza
1,CIS22A,MW,11:30am,4.5,,De Anza
2,CIS41A,TTH,9:30am,4.5,47.0,De Anza
3,CIS18B,MW,1:30pm,4.5,,De Anza


Remove unnecessary columns

In [79]:
# 27. One of the columns doesn't really give us any info about the classes.    ** EC **
# Which column is it?
## location
# Write code to remove this column ?
data.pop("location")
data

KeyError: 'location'

Convert a DataFrame to a numpy array

In [None]:
# 28. A DataFrame can be converted to a numpy array
# This is useful only if data are numbers

gradebook = pd.read_csv("scores.csv", header=0, names=["q1","midt","q2","final"])
print(gradebook, '\n')

arr = gradebook.values
print(type(arr))
arr

Replace data in a column

In [None]:
# 29. To replace data in a column, create a dictionary
# of old_data:new_data for the key:value 

gradebook = pd.read_csv("scores.csv")
print(gradebook, '\n')

Student = {'Sleepy':11,'Happy':12,'Doc':13,'Grumpy':14,
           'Bashful':15,'Sneezy':16,'Dopey':17,'Snow White':18 }
gradebook.replace(Student, inplace=True)
gradebook