# 1.Summary Statistics 

We can think of summary statistics as statistics that define the characteristic for a group of things. This includes:
* measures of central tendency that attempt to describe data by identifying the central position within the data e.g Mean, Median and Mode
* measures of dispersion/spread that describe the variabiliy in a sample or population e.g Range, Quartiles, Variance and Standard Deviation

The describe() method in pandas gives you high level summary statistics of your dataframe, helping you better understand the distribution of data in columns in your dataframe. The statistics given include; average, standard deviatio and quantile values. Let's look at a practical example of this

Additional Reading:
1. Measures of Central Tendency, Laerd Statistics - https://statistics.laerd.com/statistical-guides/measures-central-tendency-mean-mode-median.php
2. Measures of Spread, Laerd Statistics - https://statistics.laerd.com/statistical-guides/measures-of-spread-range-quartiles.php
3. Describe documentation - https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html

In [14]:
import pandas as pd
df = pd.read_csv('rental_listings_nigeria/castles.csv')
df.describe()

Unnamed: 0,unit,baths
count,0.0,0.0
mean,,
std,,
min,,
25%,,
50%,,
75%,,
max,,


Note that there are four columns in our dataframe but the describe() method produces output that only includes 3 columns. This is because describe only returns the summary statistics of numerical columns. The summary statistics returned are the:
1. count - this returns the number of rows for each column, the count will exclude rows that have a null or na value
2. mean 
3. standard deviation
4. lowest number
5. 25th percentile
6. 50th percentile
7. 75th percentile
8. highest number

In the event that you wanted to include the count for character columns you could include the argument include = ['object'] in your describe function. Let us look at a practical example of this

In [15]:
df.describe(include=['object'])

Unnamed: 0,price,location,bedrooms,state
count,2729,2599,1288,2729
unique,274,1722,30,1
top,Price on call,"Parkview Estate, Ikoyi",3,Lagos
freq,593,199,595,2729


As you can see by adding the argument include=['object'] we specifically ask for summary statistics pertaining to character columns, leaving our the numerical columns from our summary statistics. Instead of returning the count, mean, standard deviation and interquartile ranges we get a summary of the count, the number of unique characters per column, the top string measured by frequency and the count of the highest appearing string.

If we want to get summary statistics for all data type we simply ammend the include function to all

In [16]:
df.describe(include='all')

Unnamed: 0,price,unit,location,bedrooms,baths,state
count,2729,0.0,2599,1288.0,0.0,2729
unique,274,,1722,30.0,,1
top,Price on call,,"Parkview Estate, Ikoyi",3.0,,Lagos
freq,593,,199,595.0,,2729
mean,,,,,,
std,,,,,,
min,,,,,,
25%,,,,,,
50%,,,,,,
75%,,,,,,


This returns a mixture of summary statistics both pertaining to character and numerical columns.

In [17]:
df.head()

Unnamed: 0,price,unit,location,bedrooms,baths,state
0,\r\n ...,,"Millenium Estate, Gbagada Lagos",,,Lagos
1,\r\n ...,,"Parkview Estate, Ikoyi",,,Lagos
2,\r\n ...,,From Bourdillon Road Ikoyi,4.0,,Lagos
3,\r\n ...,,No. 22 Akiogun Street Oniru,,,Lagos
4,Price on call,,"Estaport, Gbagada, Lagos State",3.0,,Lagos


# 2. Primer to data cleaning

When people talk about data science the hype focuses on machine learning. However, the truth of the matter is a bulk of your time as both a data scientist and analyst might be spent getting data, understanding and cleaning it to make it more consistent and easier to analyse. It has been said that data scienstist spend up to 80% of their time obtaining and cleaning data.

**What is data cleaning**

Data cleaning refers to the process of ensuring that your data is consistent, removing corrupt and clearly incorrect data that may have an adverse effect on your analysis. This may include dropping irrelevant columns. If for example you are looking to build a machine learning model to predict the price of a house in Lagos, you realise that some columns have no intrinsic value, they do not tell you any useful information that can help in creating a model that can generalize to new data. This could for example include columns such as the plot number allocated to the property. This process may require you to have a better understanding of your data and its context. If you have a better understanding of the real estate market you would be more equipped to understand which columns of data are irrelevant.

***Duplicates***

The data cleaning process also includes removing duplicate data. However in some cases duplicate data may indicate something occuring at a more regular frequency for example in our dataset, duplications in the location column may simply be an indicator of rentals in a certain location appearing more frequently in our dataset. However let us assume each rental listing had a unique code and we noticed that one unique code appeared more than once and referred to the exact some property. In such a scenario we would want to remove the duplicate so we do not double count the listing

***Data Validity***

In certain cases we expect data to conform to some rules, ranges and constraints. We may expect:
1. a value in a column to be a certain data type. For example a column asking whether or not a house listing has a garage may either have True of False boolean values. We can refer to this as a data-type constraint
2. a value in a numerical column to lie within a certain range for example a score out of 10 to be between 1 and 10. We can refer to this as a range constraint
3. each value in a column to be unique. We can refer to this as a unique constraint.

There are a few examples of constraints we have expect from each column. The constraints will be data and context dependant.

***Uniformity***

It is important to retain uniformity in how data is recorded and measured. For example if we measure weight using kilograms we do not want to see another row were the weight is given in grams or pounds as this would make the data in the column inconsistent. 

***Syntax errors***

We may encounter columns with typos, unecessary whitespaces or characters that change the data type of values in a given column. For example a semi colon in a column that is supposed to have numerical values may change the data type of an integer or float to a string. We would need to remove the semi colon and convert the value to a float or string.

***Missing Values***

In certain instances we may encounter columns that have null or na values indicating that the relevant data is missing. It is important to note that while we use NaN as the default value marker for missing values, None or NA may also be used in other cases. When 

### 2.1 Dealing with duplicates

Let us open the fullnames_ids file, save it as a dataframe and look at how we can deal with duplicate values practically.

Here is the context, each person has an ID and that ID should be a unique value that is associated with just one person. We therefore have to check if there are any duplicated values in the ID column. We can either use the duplicated() function and return the count from this function. The duplicated function returns boolean values of True and False depending on whether a specific row has a duplicated value in other rows. 

Read more on duplicated function: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.duplicated.html

In [29]:
file = pd.read_csv('fullnames_ids.csv')
file['ID'].duplicated().sum()

6

From running this function we see that there are 6 duplicates. Let us filter the duplicate values in our dataframe

In [30]:
file[file['ID'].duplicated()]

Unnamed: 0,ID,Name,Surname
1,9112,Halima,Oni
2,9112,Halima,Oni
14,9112,Halima,Oni
15,9112,Halima,Oni
16,9111,Nyamwali,Larbi
17,9111,Nyamwali,Larbi


From this we can see that the count of duplicated values refers to two IDs; 9112 and 9111. Since we know that these are duplicates as opposed to different people having been assigned an incorrect ID, we will drop the duplicated IDs using the drop_duplicates() function inplace.

Read more about the drop_duplicates function: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html 



In [31]:
file.drop_duplicates(subset='ID',inplace=True)
file

Unnamed: 0,ID,Name,Surname
0,9112,Halima,Oni
3,9111,Nyamwali,Larbi
4,9113,Jacline,Iddrisu
5,9114,Thadei,Juma
6,9115,Cosmas,Zedi
7,9116,Simbarashe,Moyo
8,9117,Takudzwa,Ndlovu
9,9118,Gabisile,Ndabezitha
10,9119,Duduzile,Mgazi
11,9120,Lindelani,Meqo


### 2.2 Dealing with un-uniform data

Now we are going to deal with a dataframe where we have inconsistent data in both the height and weight column. In this scenario we are going to assume we want to convert all weight to kilograms and all height columns to centimeters. 

We will start off by reading the relevant csv as a pandas dataframe.

In [137]:
file_one = pd.read_csv('names_weight_height.csv')
file_one

Unnamed: 0,ID,Name,Surname,Weight,Height
0,9112,Halima,Oni,80kgs,178cm
1,9111,Nyamwali,Larbi,198lbs,180cm
2,9113,Jacline,Iddrisu,95kgs,160cm
3,9114,Thadei,Juma,220lbs,158cm
4,9115,Cosmas,Zedi,300lbs,164cm
5,9116,Simbarashe,Moyo,170lbs,5.3 feet
6,9117,Takudzwa,Ndlovu,50kgs,6.1 feet
7,9118,Gabisile,Ndabezitha,49kgs,5.7 feet
8,9119,Duduzile,Mgazi,145lbs,5.9 feet
9,9120,Lindelani,Meqo,49kgs,171cm


Let us start with the weight column. We want to remove the trailing 'kgs' characters since we ultimately want the values in this column to be floats (in the event that converting pounds to kilograms creates decimal values). We do not want to carry out any other calculation to values that have trailing 'kgs' characters.

If on the other hand the value has trailing 'lbs'characters we want to firstly remove these characters and then multiply the values by 0.453592 in order to convert these values to kilograms.

We can think about looping through each value in our weight column, use an if, else conditional statement applying different conditions depending on whether a value has a trailing 'kgs' or 'lbs'. However, we would quickly learn that a crude loop in pandas is generally the slowest way to get anything done and does not take advantage of inbuilt optimizations and vectorization in both pandas and numpy.

For more reading:
* A beginner's guide to optimizing pandas code for speed: https://engineering.upside.com/a-beginners-guide-to-optimizing-pandas-code-for-speed-c09ef2c6a4d6
* Fast, flexible, easy and intuitive: How to speed up your pandas projects https://realpython.com/fast-flexible-pandas/
* Vectorization and parallelization with numpy and pandas, WZB Data Science Blog https://realpython.com/fast-flexible-pandas/

We are going to apply a lambda function that will strip the trailing kgs if the weight contains kgs and convert the result to a numerical object. Alternatively, if the value contains a trailing lbs then we will strip the lbs characters and convert the result to a numerical object.

In [138]:
pounds_to_kgs= 0.453592
file_one['Weight'] = file_one['Weight'].apply(lambda x: pd.to_numeric(x.strip('kgs')) if 'kgs' in x
                        else pd.to_numeric(x.strip('lbs')) * pounds_to_kgs)
file_one

Unnamed: 0,ID,Name,Surname,Weight,Height
0,9112,Halima,Oni,80.0,178cm
1,9111,Nyamwali,Larbi,89.811216,180cm
2,9113,Jacline,Iddrisu,95.0,160cm
3,9114,Thadei,Juma,99.79024,158cm
4,9115,Cosmas,Zedi,136.0776,164cm
5,9116,Simbarashe,Moyo,77.11064,5.3 feet
6,9117,Takudzwa,Ndlovu,50.0,6.1 feet
7,9118,Gabisile,Ndabezitha,49.0,5.7 feet
8,9119,Duduzile,Mgazi,65.77084,5.9 feet
9,9120,Lindelani,Meqo,49.0,171cm


Ok, now let's walk through what we just did. So firstly we started off by using the apply function that allows us to pass a function to every single value in the weight column. We then created a lambda function that checks if a given row contains the characters 'kgs'. If this evaluates to True we then strip the 'kgs' characters and convert the output to a numerical value which can either be an integer or float (the default depends on the data we are using). 

If however the value does not contain 'kgs' (in this instance the value either contains a trailing kgs or lbs so after finding all values that contain kgs we only need to use an else statement to capture conditions that are an exception to the first if statement), we will remove the trailing lbs value and convert the output to a numerical value. 

By doing this we have made the values in our weight column consistent.

Apply function documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html
Pd.to_numeric function documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_numeric.html

In other more simpler scenarios you may for example need to convert a date to a consitent format. For this you could use the astype or to_datetime function specifynig the format to match you desired output

Read more on converting column to datetime: https://stackoverflow.com/questions/26763344/convert-pandas-column-to-datetime

#### 2.2.1 Dealing with datetime data

Let's look a bit more into datetime data. In many cases you will encounter instances where you have a dataset with either a timestamp, date or datetime column. You may either need to convert a timestamp to full date or extract the month, date and or year from a date column.

Here is a practical example for us to look at

In [94]:
datetime_file = pd.read_csv('datetime.csv')
datetime_file['Unix Timestamp'] = pd.to_datetime(datetime_file['Unix Timestamp'], unit='s')
datetime_file['Unix Timestamp']

0   2020-02-24 11:31:16
1   2009-10-12 20:00:29
2   2017-02-11 19:30:30
3   2019-12-12 17:29:50
4   2019-11-05 20:00:00
5   2020-02-24 11:33:38
6   2009-08-17 20:05:34
Name: Unix Timestamp, dtype: datetime64[ns]

We decided to start off by looking at a unix time or epoch time. This refers to the number of seconds that have elabased since the Unix Epoch that is since 1 January 10970 00:00:00 UTC.

To convert this time to a datetime format we use pd.to_datetime and specify the units as seconds since the unit of measure of our timestamp is seconds.

In [96]:
datetime_file['Date'] = pd.to_datetime(datetime_file['Date'])
datetime_file['Date']

0   2009-01-01
1   2016-05-07
2   2007-09-11
3   2009-08-15
4   2014-02-15
5   2018-05-19
6   2019-10-24
Name: Date, dtype: datetime64[ns]

Unlike the Unix Timestamp column, the unit of measure in the Date column is not seconds we do not have to specify the unit of measure in this case.

Now let's assume we want to create a column that is the time difference between the first column 'Unix Timestamp' and 'Date'. We want the time delta shown in number of days. To to this we will subtract the two columns like we would any other numerical column. From this we will get the timedelta of the two columns, this will include the number of days along with hours minutes and seconds difference between the two. We will then use Pandas Series.dt.days method to extract only the number of days difference between the two columns and save this result to a column we will name *'Days Delta'*

In [104]:
datetime_file['Days Delta'] = (datetime_file['Unix Timestamp'] - datetime_file['Date']).dt.days
datetime_file['Days Delta']

0    4071
1   -2399
2    3441
3    3771
4    2089
5     646
6   -3720
Name: Days Delta, dtype: int64

Now let's assume we need to extract the day of the week, month, year and hour from our Unix Timestamp column and create multiple columns for each value we extract.

In [114]:
#extracting week day
datetime_file['Unix Weekday'] = datetime_file['Unix Timestamp'].dt.day_name()
#extracting month
datetime_file['Unix Month'] = datetime_file['Unix Timestamp'].dt.month_name()
#extracting hour
datetime_file['Unix Hour'] = datetime_file['Unix Timestamp'].dt.hour
datetime_file

Unnamed: 0,Unix Timestamp,Date,Days Delta,Unix Weekday,Unix Month,Unix Hour
0,2020-02-24 11:31:16,2009-01-01,4071,Monday,February,11
1,2009-10-12 20:00:29,2016-05-07,-2399,Monday,October,20
2,2017-02-11 19:30:30,2007-09-11,3441,Saturday,February,19
3,2019-12-12 17:29:50,2009-08-15,3771,Thursday,December,17
4,2019-11-05 20:00:00,2014-02-15,2089,Tuesday,November,20
5,2020-02-24 11:33:38,2018-05-19,646,Monday,February,11
6,2009-08-17 20:05:34,2019-10-24,-3720,Monday,August,20


To get a deeper understanding on basic time series manipulation here are a few sources to help you out:
1. Basic time series manipulation with Pandas, Towards Data Science https://towardsdatascience.com/basic-time-series-manipulation-with-pandas-4432afee64ea
2. Tutorial: Time series analysis with Pandas, DataQuest, https://towardsdatascience.com/basic-time-series-manipulation-with-pandas-4432afee64ea

### 2.3 Dealing with missing values

One issue that you may encounter regularly with datasets is missing data. This can happen for many reasons, this may occur as a result of certain observations not being recorded, our dataset might have been corrupted for whatever reason or our data could be missing at random.

When data is missing at random at can either be; i.) missing completely at random; meaning the missingness of the data has no relationship between any of the values observed or otherwhise or ii.) missing at random; meaning there may be a systematic relationship between missingness and other observed data, but not the missing data. There are a few methods are available to try and ascertain the reason the data is missing. However, we will not be dealing with this. 

If you would like you can go through this source to learn more about determining whether data is missing at random: https://s3.amazonaws.com/assets.datacamp.com/production/course_17404/slides/chapter2.pdf

We typically either opt to ignore the rows with missing data, delete the rows completely or impute some value in our missing data. The most interesting of these options is imputing missing values in our data. Let's look into this a bit more

#### 2.3.1 Imputation of missing values

##### 2.3.1.1 Imputing missing values with the mean, median or mode

One of the most commonly used method of dealing with missing values is imputing missing values with the mean. Imputation in this context simply refers to replacing the missing values with another value(the value we are imputing). This is referred to as a naive imputation method as it only uses the mean value of the entire column to make assumptions about our missing data. There are a few ways to do this, however the most common and simple is by using the fillna() function.

For obvious reasons imputation with mean or method applies specifically to numeric columns as you cannot find the mean or median of categorical data. 

Moving back to our previous example let us see how we can impute missing values using the mean weight. Since there are no missing values we will create two new rows with no weights assigned to them.

In [139]:
#list of dictionaries to create dataframe with new names
new_names = [{'Name':'Jude','Surname':'Sithole'},
             {'Name':'Susan','Surname':'Marie'},
            {'Name':'Lebo','Surname':'Naengop'}]
new_df = pd.DataFrame(new_names)
#concatenate old with new dataframe
file_one = pd.concat([file_one,new_df],axis=0)
file_one

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  import sys


Unnamed: 0,Height,ID,Name,Surname,Weight
0,178cm,9112.0,Halima,Oni,80.0
1,180cm,9111.0,Nyamwali,Larbi,89.811216
2,160cm,9113.0,Jacline,Iddrisu,95.0
3,158cm,9114.0,Thadei,Juma,99.79024
4,164cm,9115.0,Cosmas,Zedi,136.0776
5,5.3 feet,9116.0,Simbarashe,Moyo,77.11064
6,6.1 feet,9117.0,Takudzwa,Ndlovu,50.0
7,5.7 feet,9118.0,Gabisile,Ndabezitha,49.0
8,5.9 feet,9119.0,Duduzile,Mgazi,65.77084
9,171cm,9120.0,Lindelani,Meqo,49.0


In [140]:
file_one['Weight'] = file_one.Weight.fillna(file_one.Weight.mean())
file_one['Weight']

0      80.000000
1      89.811216
2      95.000000
3      99.790240
4     136.077600
5      77.110640
6      50.000000
7      49.000000
8      65.770840
9      49.000000
10     60.000000
11     62.000000
12     51.000000
0      74.196964
1      74.196964
2      74.196964
Name: Weight, dtype: float64

Let's walk through what we just did. The first thing you might notice is that we did not use the square brackets to reference the name of the column instead we just wrote the name of the column after the name of the dataframe. This is because there are two ways to select a column in a dataframe, the first being the bracket notation which we have been using the entire time and the dot notation, which we just used. Both have the same end result but dot notation is easier to type and read. However, it is up to you what you choose to use.

You will also notice that we added the argument file_one.Weight.mean() Inside the fillna brackets. This simply computes the mean for the selected column and replaces all missing values with the mean. We could also opt to replace the missing values with the median or mode if we so choose.

With this method we make the assumption that if some data is missing, the value of the missing data will be close to the median, mean or mode (depending on what we chose to select).

##### 2.3.1.2 What about categorical data

Now let's assume we have a non numerical column and would like to impute the value of this missing data. With categorical data we could opt to fill the missing data by replacing the missing value with the preceeding value, this if referred to as forward filling. Alternatively, we could fill the missing data with the next value that appears, this is referred to as backward filling.

We could also fill the height column with the height that appears most frequently- the mode. 

## Challenge

By now you have learnt quite a bit about pandas and data cleaning. Look at the file_one column and convert all the values in the Height column to kilograms

Additional Reading

* Tutorial: Basic statistics in Python - Descriptive Statistics, DataQuest:dataquest.io/blog/basic-statistics-with-python-descriptive-statistics/

* The Ultimate Guide to Data Cleaning : https://towardsdatascience.com/the-ultimate-guide-to-data-cleaning-3969843991d4

* Impute Missing Values :https://jamesrledoux.com/code/imputation

* Impute categorical missing values in scikit-learn: https://stackoverflow.com/questions/25239958/impute-categorical-missing-values-in-scikit-learn