![](https://miro.medium.com/max/2460/1*LRD_hq6lX-D_BM4RJcRK8w.png)

## Questions to ponder when Pre-processing/Cleaning/ the Data
* Do I have the right index?
* Do we have missing values? Or null values? 
* Do we need to change any of our dtypes? 
* How about the column names? 
* Should we drop any columns? 
* Are there duplicates? 
* Check for outliers. 

In [1]:
import pandas as pd
import numpy as np 
df = pd.read_csv('http://bit.ly/drinksbycountry')
df.head()

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
0,Afghanistan,0,0,0,0.0,Asia
1,Albania,89,132,54,4.9,Europe
2,Algeria,25,0,14,0.7,Africa
3,Andorra,245,138,312,12.4,Europe
4,Angola,217,57,45,5.9,Africa


In [2]:
#Ignore me for now
df.replace(0, np.nan, inplace=True)

In [3]:
df.head()

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
0,Afghanistan,,,,,Asia
1,Albania,89.0,132.0,54.0,4.9,Europe
2,Algeria,25.0,,14.0,0.7,Africa
3,Andorra,245.0,138.0,312.0,12.4,Europe
4,Angola,217.0,57.0,45.0,5.9,Africa


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193 entries, 0 to 192
Data columns (total 6 columns):
country                         193 non-null object
beer_servings                   178 non-null float64
spirit_servings                 170 non-null float64
wine_servings                   162 non-null float64
total_litres_of_pure_alcohol    180 non-null float64
continent                       193 non-null object
dtypes: float64(4), object(2)
memory usage: 9.1+ KB


## Deal with null values 
- How many are there? 
- Should we delete them? Replace them? What should we replace them with? Mean or median or mode or zero?
- Maybe we should get rid of the entire column. 

In [5]:
#isnull - how can we make this more helpful?
print(df.isnull().sum())

country                          0
beer_servings                   15
spirit_servings                 23
wine_servings                   31
total_litres_of_pure_alcohol    13
continent                        0
dtype: int64


In [6]:
#Lets view the rows with null values to gain more context
df[df.isna().any(axis=1)]

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
0,Afghanistan,,,,,Asia
2,Algeria,25.0,,14.0,0.7,Africa
13,Bangladesh,,,,,Asia
19,Bhutan,23.0,,,0.4,Asia
27,Burundi,88.0,,,6.3,Africa
40,Cook Islands,,254.0,74.0,5.9,Oceania
46,North Korea,,,,,Asia
55,Equatorial Guinea,92.0,,233.0,5.8,Africa
56,Eritrea,18.0,,,0.5,Africa
58,Ethiopia,20.0,3.0,,0.7,Africa


_A DataFrame object has two axes: “axis 0” and “axis 1”. “axis 0” represents rows and “axis 1” represents columns._

### A couple ways to fill in values in an entire df

In [7]:
df.fillna(0, inplace=True)

In [8]:
print(df.isnull().sum())

country                         0
beer_servings                   0
spirit_servings                 0
wine_servings                   0
total_litres_of_pure_alcohol    0
continent                       0
dtype: int64


In [9]:
#df.replace(np.nan, 0, inplace=True)
#df['column_name'].replace({0: None})

## Changing the Index 
Index is like an address, that’s how any data point across the dataframe or series can be accessed. Rows and columns both have indexes, rows indices are called as index and for columns its general column names.

In [10]:
df['country'].nunique() 

193

In [11]:
df.drop_duplicates()

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
0,Afghanistan,0.0,0.0,0.0,0.0,Asia
1,Albania,89.0,132.0,54.0,4.9,Europe
2,Algeria,25.0,0.0,14.0,0.7,Africa
3,Andorra,245.0,138.0,312.0,12.4,Europe
4,Angola,217.0,57.0,45.0,5.9,Africa
5,Antigua & Barbuda,102.0,128.0,45.0,4.9,North America
6,Argentina,193.0,25.0,221.0,8.3,South America
7,Armenia,21.0,179.0,11.0,3.8,Europe
8,Australia,261.0,72.0,212.0,10.4,Oceania
9,Austria,279.0,75.0,191.0,9.7,Europe


In [12]:
df.set_index('country', inplace=True) #another way to do this - add index_col to read_csv()

In [13]:
df.head()

Unnamed: 0_level_0,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Afghanistan,0.0,0.0,0.0,0.0,Asia
Albania,89.0,132.0,54.0,4.9,Europe
Algeria,25.0,0.0,14.0,0.7,Africa
Andorra,245.0,138.0,312.0,12.4,Europe
Angola,217.0,57.0,45.0,5.9,Africa


In [14]:
# sort index based on ascending order
df['continent'].value_counts().sort_index()

Africa           53
Asia             44
Europe           45
North America    23
Oceania          16
South America    12
Name: continent, dtype: int64

In [15]:
df.dtypes

beer_servings                   float64
spirit_servings                 float64
wine_servings                   float64
total_litres_of_pure_alcohol    float64
continent                        object
dtype: object

In [16]:
#df['column_name'].astype('int')

## Dropping columns and rows 
[Documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html)
- Pay attention to the index
- _Notice:_ Without inplace = True we are only returning a copy UNLESS we set the line = to the new df


In [17]:
#df.drop(['column_name', 'column_name'], axis=1)
#df.drop(columns=['column_name', 'column_name'])

#df['country'].drop()

## Lambda Functions 
A lambda function is a small anonymous function. Typically called a throw away function. 
A lambda function can take any number of arguments, but can only have one expression.

_lambda arguments : expression_



In [18]:
def my_func(x): 
    return x

lambda x: x

<function __main__.<lambda>(x)>

In [19]:
# Rename columns
df = df.rename(columns = lambda x: x.replace(" ", "_").lower())

## Groupby Function 
Used to split the data into groups based on some criteria.

> Table_name.groupby(['Group'])['Feature'].aggregation()

**Table_name**: this would be the name of the DataFrame, the source of the data you are working on.
<br>
**groupby:** the group by in Python is for sorting data based on different criteria. In this case, the condition is Group.
<br>
**Feature:** the part of the data or feature you want to be inserted in the computation.<br>
**aggregation():** the specific function name or aggregation you wish to execute with this operation.
- mean(): Compute mean of groups
- sum(): Compute sum of group values
- size(): Compute group sizes
- count(): Compute count of group
- std(): Standard deviation of groups
- var(): Compute variance of groups
- sem(): Standard error of the mean of groups
- describe(): Generates descriptive statistics
- first(): Compute first of group values
- last(): Compute last of group values
- nth() : Take nth value, or a subset if n is a list
- min(): Compute min of group values
- max(): Compute max of group values

In [20]:
#whats the average beer_servings per continent?
df.groupby('continent').beer_servings.mean().round()

continent
Africa            61.0
Asia              37.0
Europe           194.0
North America    145.0
Oceania           90.0
South America    175.0
Name: beer_servings, dtype: float64

In [21]:
#lets look at the descriptive stats for only wine servings and for all continents 
df.groupby("continent")["wine_servings"].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
continent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Africa,53.0,16.264151,38.846419,0.0,1.0,2.0,13.0,233.0
Asia,44.0,9.068182,21.667034,0.0,0.0,1.0,8.0,123.0
Europe,45.0,142.222222,97.421738,0.0,59.0,128.0,195.0,370.0
North America,23.0,24.521739,28.266378,1.0,5.0,11.0,34.0,100.0
Oceania,16.0,35.625,64.55579,0.0,1.0,8.5,23.25,212.0
South America,12.0,62.416667,88.620189,1.0,3.0,12.0,98.5,221.0


## apply() , map() and applymap()

- apply() is used to apply a function along an axis of the DataFrame or on values of Series.
- applymap() is used to apply a function to a DataFrame elementwise.
- map() is used to substitute each value in a Series with another value.

![](https://miro.medium.com/max/1796/1*deCRAl5DuNZ1a0TNGKYrNQ.png)


In [22]:
#map - best used for mapping strings into ints
df['num_cont'] = df['continent'].map({'Asia': 0, 'Europe': 1, })
df.head()

Unnamed: 0_level_0,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent,num_cont
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Afghanistan,0.0,0.0,0.0,0.0,Asia,0.0
Albania,89.0,132.0,54.0,4.9,Europe,1.0
Algeria,25.0,0.0,14.0,0.7,Africa,
Andorra,245.0,138.0,312.0,12.4,Europe,1.0
Angola,217.0,57.0,45.0,5.9,Africa,


In [23]:
#apply - applies a function to each element in a series or df. 
df.loc[:, ['wine_servings', 'beer_servings']].apply(np.argmax)

will be corrected to return the positional maximum in the future.
Use 'series.values.argmax' to get the position of the maximum now.
  return getattr(obj, method)(*args, **kwds)


wine_servings     France
beer_servings    Namibia
dtype: object

In [24]:
#Exercise
#Drop all the rows with any zeros and compare the summary stats 
df.replace(0, np.nan, inplace=True)

In [25]:
df.head()

Unnamed: 0_level_0,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent,num_cont
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Afghanistan,,,,,Asia,
Albania,89.0,132.0,54.0,4.9,Europe,1.0
Algeria,25.0,,14.0,0.7,Africa,
Andorra,245.0,138.0,312.0,12.4,Europe,1.0
Angola,217.0,57.0,45.0,5.9,Africa,


In [26]:
df.dropna(inplace=True)
df.head()

Unnamed: 0_level_0,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent,num_cont
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Albania,89.0,132.0,54.0,4.9,Europe,1.0
Andorra,245.0,138.0,312.0,12.4,Europe,1.0
Armenia,21.0,179.0,11.0,3.8,Europe,1.0
Austria,279.0,75.0,191.0,9.7,Europe,1.0
Azerbaijan,21.0,46.0,5.0,1.3,Europe,1.0


In [27]:
df.shape

(43, 6)

In [28]:
df.groupby("continent")["wine_servings"].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
continent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Europe,43.0,148.837209,94.524859,5.0,70.0,129.0,203.5,370.0
