## <p align="center" style="font-family: monospace, monaco;"> Chapter 5 - Data Analysis with Pandas </p>
### <p align="center" style="font-family: monospace, monaco;"> Jupyter Notebook </p>


### <p style="font-family: monospace, monaco;">DataFrame and Series </p>
- <p style="font-family: monospace, monaco;">a <em>DataFrame</em>, while similar to a two dimensional array in numpy, comes with column and row labels and each column can hold different datatypes</p>
- <p style="font-family: monospace, monaco;">when extracting a single column or row from a DataFrame, you get a series</p>
- <p style="font-family: monospace, monaco;">A series is like a one dimensional numpy array, but it has labels</p>
<img src="media/DataFrame-Series.png" alt="dataframe and series" style="width: 500px; height: 300px;">
- <p style="font-family: monospace, monaco;">Heres an example on how easy it is to switch between dataframe and excel</p>



In [3]:

import pandas as pd

pd.read_excel("datasets/course_participants.xlsx")

Unnamed: 0,user_id,name,age,country,score,continent
0,1001,Mark,55,Italy,4.5,Europe
1,1000,John,33,USA,6.7,America
2,1002,Tim,41,USA,3.9,America
3,1003,Jenny,12,Germany,9.0,Europe


- <p style="font-family: monospace, monaco;">The DataFrame is formatted as an html table</p>
- <p style="font-family: monospace, monaco;">you can create this same DataFrame from scratch by creating a nested list using the data and setting values for the <code>columns</code> and <code>index</code></p>

In [8]:
data = [["Arlo", 15, "Wakefield", 4.5, "Arlington"],
        ["Olive", 16, "Arlington Tech", 9.9, "Arlington"],
        ["Ally", 16, "H-B Woodlawn", 8.7, "Arlington"]]
df = pd.DataFrame(data=data, columns=["Name", "Age", "School", "Score", "County"], index=[1,2,3])

df

Unnamed: 0,Name,Age,School,Score,County
1,Arlo,15,Wakefield,4.5,Arlington
2,Olive,16,Arlington Tech,9.9,Arlington
3,Ally,16,H-B Woodlawn,8.7,Arlington


- <p style="font-family: monospace, monaco;"><code>data=data</code>, the data on the left is the parameter functions name, the data on the right is the name of the variable you pass into the function as an argument</p>
- <p style="font-family: monospace, monaco;">When calling the DataFrame class with <code>data=data</code> you can change it to <code>data=source_data</code></p>
- <p style="font-family: monospace, monaco;">by calling the <code>info</code> method were given some extra information, like the number of data points and the data types for each column</p>

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, 1 to 3
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Name    3 non-null      object 
 1   Age     3 non-null      int64  
 2   School  3 non-null      object 
 3   Score   3 non-null      float64
 4   County  3 non-null      object 
dtypes: float64(1), int64(1), object(3)
memory usage: 144.0+ bytes


#### <p style="font-family: monospace, monaco; font-weight: strong;">Index</p>
- <p style="font-family: monospace, monaco;">Row labels of DataFrames are called index</p>
- <p style="font-family: monospace, monaco;">If you dont have a useful index, dont add one, pandad will add one automatically</p>
- <p style="font-family: monospace, monaco;">An index allows pandas to look up data faster and is essential for many operations</p>
- <p style="font-family: monospace, monaco;">You can access index object like the following:</p>

In [11]:
df.index

Index([1, 2, 3], dtype='int64')

- <p style="font-family: monospace, monaco;">when it makes sense to, add a name to the index</p>

In [12]:
df.index.name = "user_id"

df

Unnamed: 0_level_0,Name,Age,School,Score,County
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,Arlo,15,Wakefield,4.5,Arlington
2,Olive,16,Arlington Tech,9.9,Arlington
3,Ally,16,H-B Woodlawn,8.7,Arlington


- <p style="font-family: monospace, monaco;">DataFrames can have duplicates, this will make looking up values slower</p>
- <p style="font-family: monospace, monaco;">to turn an index into a regular column use <code>reset_index()</code></p>
- <p style="font-family: monospace, monaco;">to set a new index use <code>set_index()</code></p>
- <p style="font-family: monospace, monaco;">if you dont want to lose your existing index when setting a new one, reset it first</p>

In [14]:
# reset_index, turns user_id into a normal column leaving index filled automatically by pandas
# set_index, sets "Name" column to the new index

df.reset_index().set_index("Name")

Unnamed: 0_level_0,user_id,Age,School,Score,County
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Arlo,1,15,Wakefield,4.5,Arlington
Olive,2,16,Arlington Tech,9.9,Arlington
Ally,3,16,H-B Woodlawn,8.7,Arlington


- <p style="font-family: monospace, monaco;">doing <code>df.reset_index().set_index("Name")</code> is using <em>method chaining</em></p>
- <p style="font-family: monospace, monaco;">because <code>reset_index</code> returns a DataFrame you can directly call another DataFrame method</p>
- <p style="font-family: monospace, monaco;">when calling a method on a DataFrame in the form as <code>df.method_name()</code> you are creating a copy of that DataFrame with the method applied, the original DataFrame is untouched unless you do <code>df = df.method_name()</code></p>
-- <p style="font-family: monospace, monaco;">to change the index use the <code>reindex</code> method</p>

In [16]:
df.reindex([3,6,2,12])


Unnamed: 0_level_0,Name,Age,School,Score,County
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
3,Ally,16.0,H-B Woodlawn,8.7,Arlington
6,,,,,
2,Olive,16.0,Arlington Tech,9.9,Arlington
12,,,,,


- <p style="font-family: monospace, monaco;"><code>reindex</code> will take over all the existing rows that match, and in the new indexs introduce rows with a missing value (<code>NaN</code>)</p>
- <p style="font-family: monospace, monaco;">index elements left away will be dropped</p>
- <p style="font-family: monospace, monaco;">to sort an index use <code>sort_index</code></p>

In [17]:
df.sort_index()

Unnamed: 0_level_0,Name,Age,School,Score,County
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,Arlo,15,Wakefield,4.5,Arlington
2,Olive,16,Arlington Tech,9.9,Arlington
3,Ally,16,H-B Woodlawn,8.7,Arlington


- <p style="font-family: monospace, monaco;">you can sort the rows by one or more columns by using <code>sort_values</code></p>

In [19]:
df.sort_values(["Age","Score"])

Unnamed: 0_level_0,Name,Age,School,Score,County
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,Arlo,15,Wakefield,4.5,Arlington
3,Ally,16,H-B Woodlawn,8.7,Arlington
2,Olive,16,Arlington Tech,9.9,Arlington


- <p style="font-family: monospace, monaco;">this sorts by age then score</p>
- <p style="font-family: monospace, monaco;">to sort using a singular column, provide the column name as a string</p>

#### <p style="font-family: monospace, monaco;">Columns</p>

- <p style="font-family: monospace, monaco;">to get information about the columns within the DataFrame use <code>df.columns</code></p>


In [20]:
df.columns

Index(['Name', 'Age', 'School', 'Score', 'County'], dtype='object')

- <p style="font-family: monospace, monaco;">if names for columns are not provided when creating the DataFrame, pandas will automatically input integers starting at zero as the names</p>
- <p style="font-family: monospace, monaco;">you assign a name to the columns headers the same way as indexs</p>

In [24]:
df.columns.name="Properties"

df

Properties,Name,Age,School,Score,County
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,Arlo,15,Wakefield,4.5,Arlington
2,Olive,16,Arlington Tech,9.9,Arlington
3,Ally,16,H-B Woodlawn,8.7,Arlington


- <p style="font-family: monospace, monaco;"> you can rename columns</p>

In [25]:
df.rename(columns={"Name":"First Name", "Score":"Final Score"})

Properties,First Name,Age,School,Final Score,County
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,Arlo,15,Wakefield,4.5,Arlington
2,Olive,16,Arlington Tech,9.9,Arlington
3,Ally,16,H-B Woodlawn,8.7,Arlington


- <p style="font-family: monospace, monaco;">How to delete columns and rows</p>

In [28]:
df.drop(columns=["Name", "County"], index=[1])

Properties,Age,School,Score
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2,16,Arlington Tech,9.9
3,16,H-B Woodlawn,8.7


- <p style="font-family: monospace, monaco;">Columns and rows of a DataFrame are all represented by <code>Index</code> objects, meaning you can change your columns into rows and vice versa by transposing the DataFrame</p>

In [29]:
df.T # shortcut for df.Transpose()

user_id,1,2,3
Properties,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Name,Arlo,Olive,Ally
Age,15,16,16
School,Wakefield,Arlington Tech,H-B Woodlawn
Score,4.5,9.9,8.7
County,Arlington,Arlington,Arlington


- <p style="font-family: monospace, monaco;">You can reorder the columns of a DataFrame by selecting the columns in the desired order or <code>reindex</code></p>

In [30]:
df.loc[:, ["Score", "Age", "School", "County", "Name"]].drop(columns=["County"]) # .drop is just extra, reordering is df.loc

Properties,Score,Age,School,Name
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,4.5,15,Wakefield,Arlo
2,9.9,16,Arlington Tech,Olive
3,8.7,16,H-B Woodlawn,Ally


### <p style="font-family: monospace, monaco;">Data Manipulation</p>
#### <p style="font-family: monospace, monaco;"><strong>Selecting Data</strong></p>
##### <p style="font-family: monospace, monaco;">Selecting by Label</p>
- <p style="font-family: monospace, monaco;">You can use <code>loc</code>, which means <em>location</em>, to specify which rows and columns to retrieve: <code>df.loc[rows_selection, column_selection]</code></p>

- <p style="font-family: monospace, monaco;"><code>loc</code> supprots slice notation, this means you can use a colon to all rows or columns, respectively</p>
- <p style="font-family: monospace, monaco;">You can provide a column or row name</p>
<p style="font-family: monospace, monaco;"> Data Selection by table &rarr;</p> 

| selection | Return Data Type | Example |
|-----------|------------------|---------|
| Single Value| Scalar| `df.loc[1000, "country"]`|
| One column (1d)| Series | `df.loc[:, "country"]`|
|One column (2d)| DataFrame| `df.loc[:, ["country"]]`|
|Multiple columns|DataFrame|`df.loc[:, ["country", "age"]]`|
|Range of columns|DataFrame|`df.loc[:, "name":"country"]`|
|One row (1d)|Series|`df.loc[1000, :]`|
|One row (2d)|DataFrame|`df.loc[[1000], :]`|
|Multiple rows|DataFrame|`df.loc[[1003, 1000], :]`|
|Range of rows|DataFrame|`df.loc[1000:1002, :]`|

- <p style="font-family: monospace, monaco;">using <code>loc</code> to select scalars, series, and DataFrames</p>

In [33]:
# using scalars for both row and column returns a scalar

df.loc[2, "Name"]

'Olive'

In [34]:
# using a scalar on either column or row selction returns a series 

df.loc[[2,3], "Name"]

user_id
2    Olive
3     Ally
Name: Name, dtype: object

##### <p style="font-family: monospace, monaco;"> Selecting by position </p>
- <p style="font-family: monospace, monaco;"> when subseting a DataFrame you use <code>iloc</code> which stands for integer location: <code>df.ioc[row_selection, column_selection]</code></p>

Selection | Return Data Type | Example |
|---------|------------------|---------|
|Single value|Scalar|`df.iloc[1, 2]`|
|One column (1d)|Series|`df.iloc[:, 2]`|
|One column (2d)|DataFrame|`df.iloc[:, [2]]`|
|Multiple columns|DataFrame|`df.iloc[:, [2, 1]]`|
|Range of columns|DataFrame|`df.iloc[:, :3]`|
|One row (1d)|Series|`df.iloc[1, :]`|
|One row (2d)|DataFrame|`df.iloc[[1], :]`|
|Multiple rows|DataFrame|`df.iloc[[3, 1], :]`|
|Range of rows|DataFrame|`df.iloc[1:3, :]`|

- <p style="font-family: monospace, monaco;">How to use <code>iloc</code></p>

In [36]:
# returns scalar

df.iloc[0,0]

'Arlo'

In [37]:
# returns a series

df.iloc[[0,1], 2]

user_id
1         Wakefield
2    Arlington Tech
Name: School, dtype: object

In [39]:
#this returns a DataFrame

df.iloc[:2, [0,2]]

Properties,Name,School
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Arlo,Wakefield
2,Olive,Arlington Tech


##### <p style="font-family: monospace, monaco;">Selecting by boolean indexing</p>
- <p style="font-family: monospace, monaco;">boolean series are used to select specific columns and rows in a DataFrame</p>
- <p style="font-family: monospace, monaco;">boolean DataFrames are used to select certain values across the entire DataSet</p>

In [40]:
#this is a series with only true/false

tf = (df["Age"] == 16) & (df["Score"] > 5.0)

tf

user_id
1    False
2     True
3     True
dtype: bool

In [41]:
df.loc[tf, :]

Properties,Name,Age,School,Score,County
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2,Olive,16,Arlington Tech,9.9,Arlington
3,Ally,16,H-B Woodlawn,8.7,Arlington


- <p style="font-family: monospace, monaco;">you can only use these boolean operators &rarr;</p>

| Basic Python Data Types | DataFrames and Series|
|-------------------------|----------------------|
|`and`|`&`|
|`or`| `|` |
|`not`|`~`|

- <p style="font-family: monospace, monaco;">put every boolen expression in parentheses so the operator doesnt get in the way</p>
- <p style="font-family: monospace, monaco;">if you want to refer to index, use <code>df.index</code></p>

In [42]:
df.loc[df.index>1]

Properties,Name,Age,School,Score,County
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2,Olive,16,Arlington Tech,9.9,Arlington
3,Ally,16,H-B Woodlawn,8.7,Arlington


- <p style="font-family: monospace, monaco;">you use <code>isin</code>to sort even further</p>

In [44]:
df.loc[df["School"].isin(["Wakefield", "Arlington Tech"])]

Properties,Name,Age,School,Score,County
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,Arlo,15,Wakefield,4.5,Arlington
2,Olive,16,Arlington Tech,9.9,Arlington


- <p style="font-family: monospace, monaco;">DataFrames have special syntax without <code>loc</code> to select values given teh DataFrame of booleans: <code>df[boolean_df]</code></p>
- <p style="font-family: monospace, monaco;">This is helpful when you have a DataFrame full of numbers:</p>

In [46]:
rainfall = pd.DataFrame(data={"arlington":[306.3, 403.2],
                             "Ann Arbor": [607.9,539.1],
                             "Cambridge": [800.12, 300.6]})

rainfall

Unnamed: 0,arlington,Ann Arbor,Cambridge
0,306.3,607.9,800.12
1,403.2,539.1,300.6


In [47]:
rainfall < 550.3

Unnamed: 0,arlington,Ann Arbor,Cambridge
0,True,False,False
1,True,True,True


In [48]:
rainfall[rainfall < 550.3]

Unnamed: 0,arlington,Ann Arbor,Cambridge
0,306.3,,
1,403.2,539.1,300.6


#### <p style="font-family: monospace, monaco;"><strong>Setting data</strong></p>
- <p style="font-family: monospace, monaco;">the easiest way to change data in DataFrame is by assigning values to elements using <code>loc</code> or <code>iloc</code></p>
##### <p style="font-family: monospace, monaco;">Setting data by lable or position</p>
- <p style="font-family: monospace, monaco;">the original DataFrame changes when <code>loc</code> or <code>iloc</code> are applied</p>

In [50]:
# copies original dataframe to df2

df2 = df.copy()

In [51]:
df2.loc[4, "Name"] = "Zara"

df2

Properties,Name,Age,School,Score,County
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,Arlo,15.0,Wakefield,4.5,Arlington
2,Olive,16.0,Arlington Tech,9.9,Arlington
3,Ally,16.0,H-B Woodlawn,8.7,Arlington
4,Zara,,,,


##### <p style="font-family: monospace, monaco;">Setting data by boolean indexing</p>
- <p style="font-fammily: monospace, monaco;">boolean indexing can be used to filter rows an assign values in a DataFrame</p>

In [53]:
tf = (df2["Age"] == 15) | (df2["Score"] > 8.7)

df2.loc[tf, "Name"] = "xxx"

df2

Properties,Name,Age,School,Score,County
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,xxx,15.0,Wakefield,4.5,Arlington
2,xxx,16.0,Arlington Tech,9.9,Arlington
3,Ally,16.0,H-B Woodlawn,8.7,Arlington
4,Zara,,,,


- <p style="font-family: monospace, monaco;">when you have to replace certain values across the board, make use of the special syntax and make the whole DataFrame booleans</p>

In [54]:
rainfall2 = rainfall.copy()

rainfall2

Unnamed: 0,arlington,Ann Arbor,Cambridge
0,306.3,607.9,800.12
1,403.2,539.1,300.6


In [58]:
# set values to 0 when they are below 400

rainfall2[rainfall2 < 450.3] = 0

rainfall2

Unnamed: 0,arlington,Ann Arbor,Cambridge
0,0.0,607.9,800.12
1,0.0,539.1,0.0


##### <p style="font-family: monospace, monaco;">Setting data by replacing values</p>
- <p style="font-family: monospace, monaco;">if you want to replace a certain value across the whole table, use <code>replace</code></p>

In [62]:
df2.replace("Arlington", "APSVA")

Properties,Name,Age,School,Score,County
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,xxx,15.0,Wakefield,4.5,APSVA
2,xxx,16.0,Arlington Tech,9.9,APSVA
3,Ally,16.0,H-B Woodlawn,8.7,APSVA
4,Zara,,,,


- <p style="font-family: monospace, monaco;">to only act on a certain column use: </p>

In [67]:
df2.replace({"County" : {"Arlington": "APSVA"}})

Properties,Name,Age,School,Score,County
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,xxx,15.0,Wakefield,4.5,APSVA
2,xxx,16.0,Arlington Tech,9.9,APSVA
3,Ally,16.0,H-B Woodlawn,8.7,APSVA
4,Zara,,,,


##### <p style="font-family: monospace, monaco;">Setting data by adding a new column</p>
- <p style="font-family: monospace, monaco;"> to add a new column to a DataFrame, assign values to a column name</p>
- <p style="font-family: monospace, monaco;">you can do  this by using a scalar list:</p>

In [69]:
df2.loc[: , "Grade"] = ["Sophmore", "Junior", "Junior", "Sophmore"]

df2

Properties,Name,Age,School,Score,County,Grade
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,xxx,15.0,Wakefield,4.5,Arlington,Sophmore
2,xxx,16.0,Arlington Tech,9.9,Arlington,Junior
3,Ally,16.0,H-B Woodlawn,8.7,Arlington,Junior
4,Zara,,,,,Sophmore


- <p style="font-family: monospace, monaco;">or vectorized calculations:</p>

In [70]:
df2 = df.copy() #start with a fresh copy

df2.loc[: , "Birth Year"] = 2024 - df2["Age"]

df2

Properties,Name,Age,School,Score,County,Birth Year
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,Arlo,15,Wakefield,4.5,Arlington,2009
2,Olive,16,Arlington Tech,9.9,Arlington,2008
3,Ally,16,H-B Woodlawn,8.7,Arlington,2008


#### <p style="font-fammily: monospace, monaco;"><strong>Missing Data</strong></p>
- <p style="font-fammily: monospace, monaco;">pandas uses numpys <code>np.nan </code> for missing data</p>
- <p style="font-fammily: monospace, monaco;">missing data is displayed as <code>NaN</code></p>
- <p style="font-fammily: monospace, monaco;">for times stamps <code>pd.nat</code> is used</p>
- <p style="font-fammily: monospace, monaco;">for text it is <code>None</code></p>

In [71]:
df2 = df.copy()

df2.loc[2 , "Score"] = None

df2.loc[4, : ] = None

df2

Properties,Name,Age,School,Score,County
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,Arlo,15.0,Wakefield,4.5,Arlington
2,Olive,16.0,Arlington Tech,,Arlington
3,Ally,16.0,H-B Woodlawn,8.7,Arlington
4,,,,,


- <p style="font-family: monospace, monaco;"> when cleaning a DataFrame, you may want to drop missing data, to do that use: </p>

In [72]:
df2.dropna()

Properties,Name,Age,School,Score,County
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,Arlo,15.0,Wakefield,4.5,Arlington
3,Ally,16.0,H-B Woodlawn,8.7,Arlington


- <p style="font-family: monospace, monaco;">to remove rows where its all missing data use the <code>how</code> function</p>

In [73]:
df2.dropna(how="all")

Properties,Name,Age,School,Score,County
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,Arlo,15.0,Wakefield,4.5,Arlington
2,Olive,16.0,Arlington Tech,,Arlington
3,Ally,16.0,H-B Woodlawn,8.7,Arlington


- <p style="font-family: monospace, monaco;"> to get a boolean DataFrame or series depending on whether or not there <code>NaN</code> you can use <code>isna</code></p>

In [74]:
df2.isna()

Properties,Name,Age,School,Score,County
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,False,False,False,False,False
2,False,False,False,True,False
3,False,False,False,False,False
4,True,True,True,True,True


- <p style="font-family: monospace, monaco;">to fill missing values use <code>fillna</code></p>

In [75]:
df2.fillna({"Score": df2["Score"].mean()})

Properties,Name,Age,School,Score,County
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,Arlo,15.0,Wakefield,4.5,Arlington
2,Olive,16.0,Arlington Tech,6.6,Arlington
3,Ally,16.0,H-B Woodlawn,8.7,Arlington
4,,,,6.6,


#### <p style="font-family: monospace, monaco;"><strong>Duplicate Data</strong></p>
- <p style="font-family: monospace, monaco;">you can get rid of duplicate rows by using the <code>drope_duplicates</code> method</p>

In [77]:
df2.drop_duplicates(["Age"])

Properties,Name,Age,School,Score,County
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,Arlo,15.0,Wakefield,4.5,Arlington
2,Olive,16.0,Arlington Tech,,Arlington
4,,,,,


- <p style="font-family: monospace, monaco;"> if you know a certain column has duplicates or unique values you can use these two commands ( use <code>df.index</code> instead of <code>df["County"]</code>)</p>

In [78]:
df["School"].is_unique

True

In [79]:
df["School"].unique()

array(['Wakefield', 'Arlington Tech', 'H-B Woodlawn'], dtype=object)

- <p style="font-family: monospace, monaco;"> to understand which rows are duplicated use <code>.duplicated</code> which returns a boolean</p>
- <p style="font-family: monospace, monaco;">by defult, duplicate uses the parameter <code>keep="first"</code> which keeps the first duplicate and marks others with <code>True</code></p>
- <p style="font-family: monospace, monaco;"> if you use <code>keep="false</code> it woll return every duplicate including the first as <code>true</code></p>
- <p style="font-family: monospace, monaco;">to asses the whole index use<code>df.index.duplicated()</code> or <code>df.duplicated()</code></p>

In [80]:
df["Age"].duplicated()

user_id
1    False
2    False
3     True
Name: Age, dtype: bool

In [81]:
df.loc[df["Age"].duplicated(keep=False), :]

Properties,Name,Age,School,Score,County
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2,Olive,16,Arlington Tech,9.9,Arlington
3,Ally,16,H-B Woodlawn,8.7,Arlington


#### - <p style="font-family: monospace, monaco;"><strong> Arithmetic Operations </strong></p>
- <p style="font-family: monospace, monaco;"> DataFrames make use of vectorization like numpy</p>

In [82]:
rainfall

Unnamed: 0,arlington,Ann Arbor,Cambridge
0,306.3,607.9,800.12
1,403.2,539.1,300.6


In [83]:
rainfall +100

Unnamed: 0,arlington,Ann Arbor,Cambridge
0,406.3,707.9,900.12
1,503.2,639.1,400.6
