# 1 Install

`pip install pandas` in terminal

# 2 Import library

`import pandas as pd`

Import csv file `df=pd.read_csv('path/file.csv')`

return as a **dataframe**: rows and columns of data

**Several ways to glance at your dataframe**

`df.shape`

`df.info()` a method, that's why we need a parenthesis

*object usually means strings*

`df.head()` `df.tail()`

Pandas only show 20 columns as default, but if you want to change the setting.

To change the number of displayed columns: `pd.set_option('display.max_columns',85)`

To change the number of displayed rows: `pd.set_option('display.max_rows',85)`

# 3 Dataframe in Python (without pandas)

In Python, we can create a dataframe directly by creating a dictionary of lists. 

Key is the column, value is the conetent in the columns, the corresponding list.

In [58]:
import pandas as pd

people = {
    'first':['Jane', 'John', 'Jing','Amy', 'John'],
    'last':['Doe', 'Glassman', 'Murfey','Anderson', 'Doe'],
    'email':['janedoe@gmail.com', 'glassman@gmail.com', 'murfey@gmail.com','amy@gmail.com','johndoe@gmail.com']
}

In [2]:
# check the values in the first key
people['first']

#return a list

['Jane', 'John', 'Jing', 'Amy', 'John']

In [59]:
df_people=pd.DataFrame(people)
df_people #return a dataframe

Unnamed: 0,first,last,email
0,Jane,Doe,janedoe@gmail.com
1,John,Glassman,glassman@gmail.com
2,Jing,Murfey,murfey@gmail.com
3,Amy,Anderson,amy@gmail.com
4,John,Doe,johndoe@gmail.com


# 4 The way to check up columns in a dataframe

## 4.1 Check up one column

In [None]:
# method 1
df_people['email']

In [None]:
type(df_people['email'])

return a series instead of a list: a rows of data, 1D dimension of data

Dataframe is actually a container, containing multiple series object.

In [None]:
#method 2, but not recommended.
df_people.email

## 4.2 Check up multiple columns: different ways

* method 1: `df_people[['first','email']]`

* method 2: `df.iloc[0]` - searching by *integer location*, passing an index(0) of row

* method 3: `df.loc[]` - searching by the *label*

* passing multiple rows - passing a list of index: `df.iloc[[0,1]]` or `df.loc[0:2,'first':'email']`

In [None]:
df_people.iloc[[0,1]]

In [None]:
#select the sepecific column
df_people.iloc[[0,1],1]

In [None]:
df_people.loc[[0,1],['last','email']]

In [None]:
df_people.loc[0:2,'first':'email']

count the number of each response: `df['Hobbyist'].value_counts()`

# 5 How to set, reset and use Indexes

1. Set a specific column as the index column by `df.set_index('col_name')`. This way will not change the original dataframe settings. If one want to change this setting permenantly, add `inplace=True` to the arguement.

2. To reset back to default, just need to run `df.set_index(inplace=True)` without column name.

3. This setting can also be done while importing the dataframe. `df = pd.read_csv('path/file.csv', index_col='column_name')` 

## When using set_index()

1. for searching

2. for sorting

`df.sort_index(ascending=False, inplace=True)` sort index alphabetically

In [4]:
df1=df_people.set_index('email') #do not change the original df
#change the original df index
#df_people.set_index('email', inplace=True) 

In [5]:
df1

Unnamed: 0_level_0,first,last
email,Unnamed: 1_level_1,Unnamed: 2_level_1
janedoe@gmail.com,Jane,Doe
glassman@gmail.com,John,Glassman
murfey@gmail.com,Jing,Murfey
amy@gmail.com,Amy,Anderson
johndoe@gmail.com,John,Doe


In [6]:
df1=df1.reset_index() #back to default setting

In [7]:
df1

Unnamed: 0,email,first,last
0,janedoe@gmail.com,Jane,Doe
1,glassman@gmail.com,John,Glassman
2,murfey@gmail.com,Jing,Murfey
3,amy@gmail.com,Amy,Anderson
4,johndoe@gmail.com,John,Doe


# 6 Filtering - Using conditionals to filter rows and columns

## 6.1 One condition


In [9]:
filt = (df_people['last'] == 'Doe')

#filt itselt is a series of boolean objects

In [10]:
df_people[filt]

# same as df_people[df_people['last'] == 'Doe']

df_people.loc[filt, 'email']

0    janedoe@gmail.com
4    johndoe@gmail.com
Name: email, dtype: object

## 6.2 Multiple conditions 

connect conditions by operations: &, |

In [None]:
filt1 = ((df_people['last'] == 'Doe')&(df_people['first']=='Jane'))
df_people.loc[filt1]

In [12]:
filt2 = ((df_people['last'] == 'Doe')|(df_people['first']=='John'))
df_people.loc[filt2]

Unnamed: 0,first,last,email
0,Jane,Doe,janedoe@gmail.com
1,John,Glassman,glassman@gmail.com
4,John,Doe,johndoe@gmail.com


## 6.3 Return the opposite result

In [13]:
df_people.loc[~filt2] #~filter2 returns an opposite result

Unnamed: 0,first,last,email
2,Jing,Murfey,murfey@gmail.com
3,Amy,Anderson,amy@gmail.com


## 6.4 Select values that are in a list

In [None]:
lastname=['Glassman','Doe','Anderson']
filt3=df_people['last'].isin(lastname)

In [None]:
df_people.loc[filt3, 'email']

## 6.5 Select values that contains a specific string

In [None]:
filt4=df_people['last'].str.contains('sm',na=False)
df_people.loc[filt4, 'email']

# 7 Updating Rows and Columns - Modifying Data within Dataframes

The notes are taken from this tutorial [video](https://www.youtube.com/watch?v=DCDe29sIKcE)

<p style="color:red">Do not make change on the original dataframe. Instead, create a new one by copy and make changes.</p>

## 7.1 Updating columns

1. to replace the name of all column

`df.columns = ['name1','name2'...]`

2. to change all the column name to uppercase/lowercase or replacing space with underscore using **comprehension**

`df.column = [x.upper() for x in df.columns]`

3. to change the name of specific column(s)

`df.rename(columns={'old_name1':'new_name1', 'old_name2':'new_name2'}, inplace=True)`


In [16]:
df1=df_people
df1.columns = ['first name','last name','email']

In [17]:
# change the name to uppercase
df1.columns = [x.lower() for x in df_people.columns]
df1.columns

Index(['first name', 'last name', 'email'], dtype='object')

In [18]:
#replace space with underscore, vice versa
df1.columns = df_people.columns.str.replace('_', ' ')
df1.columns

Index(['first name', 'last name', 'email'], dtype='object')

In [19]:
df1.rename(columns={'first name':'first', 'last name':'last'}, inplace=True)
df1

Unnamed: 0,first,last,email
0,Jane,Doe,janedoe@gmail.com
1,John,Glassman,glassman@gmail.com
2,Jing,Murfey,murfey@gmail.com
3,Amy,Anderson,amy@gmail.com
4,John,Doe,johndoe@gmail.com


## 7.2 Updating rows

1. Updating a single row using 

`df.loc[]`

2. Updating several values in a row using 

`df.loc[,]`

3. Updating the values in a column all to lowercase

`df['col']=df['col'].str.lower()` need to assign to a column for change

In [None]:
df2 = df_people
df2.loc[2] = ['Micheal','Smith','JohnSmith@gmail.com']

In [23]:
df2.loc[2,['last','email']] = ['Mike','JohnMike@gmail.com']
df2

Unnamed: 0,first,last,email
0,Jane,Doe,janedoe@gmail.com
1,John,Glassman,glassman@gmail.com
2,Micheal,Mike,JohnMike@gmail.com
3,Amy,Anderson,amy@gmail.com
4,John,Doe,johndoe@gmail.com


In [24]:
df2.loc[2,'last']='Smith'
#the same as df2.at[2,'last']='Smith' for changing a single value
df2

Unnamed: 0,first,last,email
0,Jane,Doe,janedoe@gmail.com
1,John,Glassman,glassman@gmail.com
2,Micheal,Smith,JohnMike@gmail.com
3,Amy,Anderson,amy@gmail.com
4,John,Doe,johndoe@gmail.com


**A common mistake is easily made.**



In [25]:
# this won't work
filt4 = (df2['last']=='Doe')
df2[filt]['last']= 'Smith'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._set_item(key, value)


In [60]:
df2.loc[filt, 'last']= 'Smith'

In [61]:
# change emails all to lowercase
df_people['email'].str.lower() # not changing the original dataframe

df3=df_people

df3['email'] =df3['email'].str.lower()
df3

Unnamed: 0,first,last,email
0,Jane,Doe,janedoe@gmail.com
1,John,Glassman,glassman@gmail.com
2,Jing,Murfey,murfey@gmail.com
3,Amy,Anderson,amy@gmail.com
4,John,Doe,johndoe@gmail.com


## 7.3 apply() 

### 7.3.1 apply() to series

`df['col'].apply(func)`

In [31]:
#build-in function
df3['email'].apply(len)

0    17
1    18
2    18
3    13
4    17
Name: email, dtype: int64

In [32]:
#defined function

def upper_email(email):
    return email.upper()

df3['email'].apply(upper_email)

0     JANEDOE@GMAIL.COM
1    GLASSMAN@GMAIL.COM
2    JOHNMIKE@GMAIL.COM
3         AMY@GMAIL.COM
4     JOHNDOE@GMAIL.COM
Name: email, dtype: object

In [34]:
df3['email']=df3['email'].apply(upper_email)
df3

Unnamed: 0,first,last,email
0,Jane,Smith,JANEDOE@GMAIL.COM
1,John,Glassman,GLASSMAN@GMAIL.COM
2,Micheal,Smith,JOHNMIKE@GMAIL.COM
3,Amy,Anderson,AMY@GMAIL.COM
4,John,Smith,JOHNDOE@GMAIL.COM


In [36]:
df3['email']=df3['email'].apply(lambda x: x.lower())
df3

Unnamed: 0,first,last,email
0,Jane,Smith,janedoe@gmail.com
1,John,Glassman,glassman@gmail.com
2,Micheal,Smith,johnmike@gmail.com
3,Amy,Anderson,amy@gmail.com
4,John,Smith,johndoe@gmail.com


### 7.3.2 apply() to dataframe

In [37]:
df3.apply(len) # return the length of each columns instead of each cell

first    5
last     5
email    5
dtype: int64

In [38]:
df3.apply(len,axis=1)

0    3
1    3
2    3
3    3
4    3
dtype: int64

In [40]:
df3.apply(pd.Series.min)

first              Amy
last          Anderson
email    amy@gmail.com
dtype: object

In [41]:
df3.apply(lambda x: x.min())

first              Amy
last          Anderson
email    amy@gmail.com
dtype: object

## 7.3 applymap() for dataframe

In [42]:
df3.applymap(len)

Unnamed: 0,first,last,email
0,4,5,17
1,4,8,18
2,7,5,18
3,3,8,13
4,4,5,17


In [49]:
df3.applymap(str.upper)
df3.applymap(str.lower)
df3 # the orignal dataframe was not changed!

Unnamed: 0,first,last,email
0,Jane,Smith,janedoe@gmail.com
1,John,Glassman,glassman@gmail.com
2,Micheal,Smith,johnmike@gmail.com
3,Amy,Anderson,amy@gmail.com
4,John,Smith,johndoe@gmail.com


## 7.3 map() and replace() and for series

* map() is to substitue value in a series, but the value that is not assigned will convert to NaN

* replace() is to replace the value with the new one but keeps the original value

In [52]:
df3['first'].map({'Jane':'Mary','John':'Chris','Amy':'Emily'})

0     Mary
1    Chris
2      NaN
3    Emily
4    Chris
Name: first, dtype: object

In [53]:
df3['first'].replace({'Jane':'Mary','John':'Chris','Amy':'Emily'})

0       Mary
1      Chris
2    Micheal
3      Emily
4      Chris
Name: first, dtype: object

In [55]:
df3['first'] = df3['first'].replace({'Jane':'Mary','John':'Chris','Amy':'Emily'})
df3

Unnamed: 0,first,last,email
0,Mary,Smith,janedoe@gmail.com
1,Chris,Glassman,glassman@gmail.com
2,Micheal,Smith,johnmike@gmail.com
3,Emily,Anderson,amy@gmail.com
4,Chris,Smith,johndoe@gmail.com


# 8 Add/Remove Rows and Columns

`df['new_col']=[]`

In [74]:
df_people

df5=df_people

In [75]:
df5['first']+' '+df5['last']

0         Jane Doe
1    John Glassman
2      Jing Murfey
3     Amy Anderson
4         John Doe
dtype: object

## 8.1 Add a new column

In [76]:
df5['full_name']=df5['first']+' '+df5['last']

In [77]:
df5

Unnamed: 0,first,last,email,full_name
0,Jane,Doe,janedoe@gmail.com,Jane Doe
1,John,Glassman,glassman@gmail.com,John Glassman
2,Jing,Murfey,murfey@gmail.com,Jing Murfey
3,Amy,Anderson,amy@gmail.com,Amy Anderson
4,John,Doe,johndoe@gmail.com,John Doe


## 8.2 Remove columns

`df.drop(columns=['col1','col2'])`

In [None]:
df6=df5
df6.drop(columns=['first','last'], inplace = True)

In [83]:
df6

Unnamed: 0,email,full_name
0,janedoe@gmail.com,Jane Doe
1,glassman@gmail.com,John Glassman
2,murfey@gmail.com,Jing Murfey
3,amy@gmail.com,Amy Anderson
4,johndoe@gmail.com,John Doe


In [86]:
df6[['first','last']]=df6['full_name'].str.split(' ', expand=True)
df6

Unnamed: 0,email,full_name,first,last
0,janedoe@gmail.com,Jane Doe,Jane,Doe
1,glassman@gmail.com,John Glassman,John,Glassman
2,murfey@gmail.com,Jing Murfey,Jing,Murfey
3,amy@gmail.com,Amy Anderson,Amy,Anderson
4,johndoe@gmail.com,John Doe,John,Doe


## 8.3 Add a single row 

`df.append()`

In [88]:
df6.append({'first':'Tony','last': 'Theguy'}, ignore_index=True)

Unnamed: 0,email,full_name,first,last
0,janedoe@gmail.com,Jane Doe,Jane,Doe
1,glassman@gmail.com,John Glassman,John,Glassman
2,murfey@gmail.com,Jing Murfey,Jing,Murfey
3,amy@gmail.com,Amy Anderson,Amy,Anderson
4,johndoe@gmail.com,John Doe,John,Doe
5,,,Tony,Theguy


In [90]:
people2 = {
    'first':['Tony','Steve'],
    'last':['Stark','Rogers'],
    'email':['IronMan@avenge.com','Cap@avenge.com']
}
df_people2=pd.DataFrame(people2)

**Combine two dataframe**

In [93]:
df6=df6.append(df_people2,ignore_index=True) # no inplace arguement
df6

Unnamed: 0,email,full_name,first,last
0,janedoe@gmail.com,Jane Doe,Jane,Doe
1,glassman@gmail.com,John Glassman,John,Glassman
2,murfey@gmail.com,Jing Murfey,Jing,Murfey
3,amy@gmail.com,Amy Anderson,Amy,Anderson
4,johndoe@gmail.com,John Doe,John,Doe
5,,,,Steve
6,IronMan@avenge.com,,Tony,Stark
7,Cap@avenge.com,,Steve,Rogers
8,IronMan@avenge.com,,Tony,Stark
9,Cap@avenge.com,,Steve,Rogers


## 8.4 Remove rows

`df.drop(index=4)`

In [98]:
df6.drop(index=df6[df6['first']=='Steve'].index)

Unnamed: 0,email,full_name,first,last
0,janedoe@gmail.com,Jane Doe,Jane,Doe
1,glassman@gmail.com,John Glassman,John,Glassman
2,murfey@gmail.com,Jing Murfey,Jing,Murfey
3,amy@gmail.com,Amy Anderson,Amy,Anderson
4,johndoe@gmail.com,John Doe,John,Doe
5,,,,Steve
6,IronMan@avenge.com,,Tony,Stark
8,IronMan@avenge.com,,Tony,Stark


# 9 Sorting Data 

1. `df.sort_value(by='col', ascending=False)`

2. to change it back: `df.sort_index()`

3. to sort values in a column: `df['col'].set_values()`

4. to select the largest 10 values: `df['col'].nlargest(10)` or `df.nlargest(10,'col')`

5. to select the smallest 10 values: `df['col'].nsmallest(10)` or `df.nsmallest(10,'col')`

In [99]:
df9=df_people

In [101]:
df9.sort_values(by='last', ascending=False)

Unnamed: 0,email,full_name,first,last
5,,,,Steve
2,murfey@gmail.com,Jing Murfey,Jing,Murfey
1,glassman@gmail.com,John Glassman,John,Glassman
0,janedoe@gmail.com,Jane Doe,Jane,Doe
4,johndoe@gmail.com,John Doe,John,Doe
3,amy@gmail.com,Amy Anderson,Amy,Anderson


In [102]:
df9.sort_values(by=['last','first'], ascending=False)

Unnamed: 0,email,full_name,first,last
5,,,,Steve
2,murfey@gmail.com,Jing Murfey,Jing,Murfey
1,glassman@gmail.com,John Glassman,John,Glassman
4,johndoe@gmail.com,John Doe,John,Doe
0,janedoe@gmail.com,Jane Doe,Jane,Doe
3,amy@gmail.com,Amy Anderson,Amy,Anderson


In [104]:
df9.sort_values(by=['last','first'], ascending=[False,True], inplace=True)

In [105]:
df9.sort_index()

Unnamed: 0,email,full_name,first,last
0,janedoe@gmail.com,Jane Doe,Jane,Doe
1,glassman@gmail.com,John Glassman,John,Glassman
2,murfey@gmail.com,Jing Murfey,Jing,Murfey
3,amy@gmail.com,Amy Anderson,Amy,Anderson
4,johndoe@gmail.com,John Doe,John,Doe
5,,,,Steve


In [106]:
df9['email'].sort_values()

3         amy@gmail.com
1    glassman@gmail.com
0     janedoe@gmail.com
4     johndoe@gmail.com
2      murfey@gmail.com
5                   NaN
Name: email, dtype: object

# 10 Grouping and Aggregating

The notes are taken following this tutorial [video](https://www.youtube.com/watch?v=txMdrV1Ut64)


## 10.1 Basic aggregation

* Aggregating example: `df['col'].median()` ingoring NaN values

    `df.median()` returns all the median number of numeric columns
    
    
    
* If one wants to look at the the descriptive statistic information of the whole dataframe: `df.describe()`

    return count(non missing rows), mean, std, min, max, etc.


* `df['col'].value_counts()` can count each response 

    normalization: `df['col'].value_counts(normalize=True)`

## 10.2 Grouping

`df.groupby(['col'])` returns a DataFrameGroupBy object, which can be assigned and then shown. For example,

<code>country_grp = df.groupby(['col'])</code>

</br>
<code>country_grp.get_group('United States')</code>

It is the same as a filter: `df.loc[df['country']=='United States']`


If we want to look at the social media popularity in each country

* `df.loc[df['country']=='United States']['SocialMedia'].value_counts()` only for one country

* `country_grp['SocialMedia'].value_counts()` retunrs all the countries. The table will have two index levels.

    We can still look at the specific coutry by `.loc[]` 
    
    `country_grp['SocialMedia'].value_counts().loc['India']`

### Groupby combined with aggregation functions

* Looking at the median of salary in each countries: `country_grp['Salary'].median()`
    For a specific country: `country_grp['Salary'].median().loc['Germany']`
    
    
* **Multiple aggregation functions**: `country_grp['Salary'].agg(['median','mean']).loc['Germany']`

### Apply() for DataFrameGroupby Object

* If we want to look at how many responsers use Python as one of their working languages in a specific country: `df.loc[df['country']=='India']['LanguageWorkWith'].str.contains('Python').sum()`

    `.sum()` can not only be used for numeric values but also for boolean type (True = 1, False = 0)
    
    
* If we want to look at how many responsers use Python as one of their working languages in each country: `country_grp['LanguageWorkWith'].apply(lambda x: x.str.contains('Python').sum())`

 1. The reason that `country_grp['LanguageWorkWith'].str.contains('Python').sum()` won't work because `.str.contrain()` only works for Series Objects, but `country_grp['LanguageWorkWith']` returns a DataFrameGroupBy Objects. Therefore, we need to use `.apply()`.
 
 2. `lambda x` is a Series.
 
    
* To show what % of people from each country known Python?

 1. Calculate the number of total repondents in each country: `country_respondentes = df['country'].value_counts()`

 2. Calculate the number of repondents who use Python in each country: `country_uses_python=country_grp['LanguageWorkWith'].apply(lambda x: x.str.contains('Python').sum())`

 3. Combine above two results together into one new dataframe: `python_df=pd.concat([country_respondentes, country_uses_python],axis=1,sort=False)`
 
 4. Rename the column to make the column make sense: `python_df.rename(columns:{'Country':'NumResp','LanguagueWorkedWith':'NumPython'}, inplace=True)`
 
 5. Calculate the percentage: `python_df['percKnownsPython']= (python_df['country_uses_python']/python_pd['country_respondentes'*100])`
    
 6. Sort the result: `python_df.sort_values(by='percKnownsPython', ascending=False, inplace=True)`
 



# 11 Cleaning Data

## 11.1 Dealing with missing values

### Drop NaN values

**Check whether is a NaN**: `df.isna()`

`df.dropna()` = `df.dropna(axis='index', how='any')`

To each arguments: 

* `axis='index'` will drop the rows containing any NaN values, and `axis='columns'` will drop the columns containing any NaN values when `how='any'`

* `how='all'` is the criteria, that is, the rows/columns are dropped only if all the values are NaN.


### Drop the rows if missing values in a specific column

For example, we allow the missing data in first name and last name, but it must have email address. Otherwise, we will drop the whole rows.

`df.dropna(axis='index', how='any', subset='email')`

### Fill the missing values with desired value

`df.fillna('0')` useful for numerical data. E.g., set all missing value to zero.

In [120]:
import numpy as np

people_demo = {
    'first':['Jane', 'John', 'Jing','Amy', np.nan, None, 'NA'],
    'last':['Doe', 'Glassman', 'Murfey','Anderson', np.nan, np.nan, 'Missing'],
    'email':['janedoe@gmail.com', 'glassman@gmail.com', 'murfey@gmail.com','amy@gmail.com',None, np.nan,'johndoe@gmail.com'],
    'age':['35', '33', '31','36', None, '29','Missing']
}

In [121]:
df_demo = pd.DataFrame(people_demo)
df_demo

Unnamed: 0,first,last,email,age
0,Jane,Doe,janedoe@gmail.com,35
1,John,Glassman,glassman@gmail.com,33
2,Jing,Murfey,murfey@gmail.com,31
3,Amy,Anderson,amy@gmail.com,36
4,,,,
5,,,,29
6,,Missing,johndoe@gmail.com,Missing


In [122]:
df_demo.dropna()

Unnamed: 0,first,last,email,age
0,Jane,Doe,janedoe@gmail.com,35
1,John,Glassman,glassman@gmail.com,33
2,Jing,Murfey,murfey@gmail.com,31
3,Amy,Anderson,amy@gmail.com,36
6,,Missing,johndoe@gmail.com,Missing


In [123]:
df_demo.dropna(how='all')

Unnamed: 0,first,last,email,age
0,Jane,Doe,janedoe@gmail.com,35
1,John,Glassman,glassman@gmail.com,33
2,Jing,Murfey,murfey@gmail.com,31
3,Amy,Anderson,amy@gmail.com,36
5,,,,29
6,,Missing,johndoe@gmail.com,Missing


In [129]:
df_demo.dropna(axis='index', how='all', subset=['email','last'])

Unnamed: 0,first,last,email,age
0,Jane,Doe,janedoe@gmail.com,35
1,John,Glassman,glassman@gmail.com,33
2,Jing,Murfey,murfey@gmail.com,31
3,Amy,Anderson,amy@gmail.com,36
6,,Missing,johndoe@gmail.com,Missing


In [130]:
df_demo.replace('NA',np.nan, inplace=True)
df_demo.replace('Missing',np.nan, inplace=True)

In [131]:
df_demo

Unnamed: 0,first,last,email,age
0,Jane,Doe,janedoe@gmail.com,35.0
1,John,Glassman,glassman@gmail.com,33.0
2,Jing,Murfey,murfey@gmail.com,31.0
3,Amy,Anderson,amy@gmail.com,36.0
4,,,,
5,,,,29.0
6,,,johndoe@gmail.com,


In [132]:
df_demo.dropna(axis='index', how='all', subset=['email','last'])

Unnamed: 0,first,last,email,age
0,Jane,Doe,janedoe@gmail.com,35.0
1,John,Glassman,glassman@gmail.com,33.0
2,Jing,Murfey,murfey@gmail.com,31.0
3,Amy,Anderson,amy@gmail.com,36.0
6,,,johndoe@gmail.com,


## 11.2 Casting missing values

`'age'` is an object, which cannot do with math calculation, so we need to convert the data type.

And `df['age'].astype(int)` won't work because `np.nan` value is a float type. `type(np.nan)`

Instead, we can convert value under 'age' to a float: `df['age'].astype(float)`

In [133]:
df_demo.dtypes

first    object
last     object
email    object
age      object
dtype: object

In [136]:
df_demo['age']=df_demo['age'].astype(float)

In [138]:
df_demo.dtypes

first     object
last      object
email     object
age      float64
dtype: object

In [141]:
df_demo['age'].mean()

32.800000000000004

## 11.3 Pass the specific missing value while importing file

`df=pd.read_csv('path/filename.csv', index_col='Respondent',na_values=['NA','Missing'])`

To check all unique values: `df['YearsCode'].unique()`