# Exploratory Data Analysis

## Steps involved in EDA

### 1.Data Sourcing 

##### Data sources can be

-Private Data : restricted to company or organisation 
(usually avalable in csv , excel form)

-Public Data : free data available to everyone (Data can be found from open government websites or kaggle ... It cann also be done through web scrapping )

   --Web scrapping : fetch data from websites through code (refer web_scrapping_imdb.ipynb for more details)

### 2.Data Cleaning 

1.Need to clean Data 

  a.Prepare data for analysis
  
  b.Possible irregularities :
  
    1.Identifying the data types
    2.Fixing the rows and columns
    3Imputing/removing missing values
    4.Handling outliers
    5.Standardising the values
    6.Fixing invalid values
    7.Filtering the data


### Segment- 2 : Data Types 

There are multiple types of data types available in the data set. some of them are numerical type and some of categorical type. You are required to get the idea about the data types after reading the data frame. 

Following are the some of the types of variables:
- **Numeric data type**: banking dataset: salary, balance, duration and age.
- **Categorical data type**: banking dataset: education, job, marital, poutcome and month etc.
- **Ordinal data type**: banking dataset: Age group.
- **Time and date type** 
- **Coordinates type of data**: latitude and longitude type.

### Segment- 3, Fixing the Rows and Columns 

Checklist for fixing rows:
- **Delete summary rows**: Total and Subtotal rows
- **Delete incorrect rows**: Header row and footer row
- **Delete extra rows**: Column number, indicators, Blank rows, Page No.

Checklist for fixing columns:
- **Merge columns for creating unique identifiers**, if needed, for example, merge the columns State and City into the column Full address.
- **Split columns to get more data**: Split the Address column to get State and City columns to analyse each separately. 
- **Add column names**: Add column names if missing.
- **Rename columns consistently**: Abbreviations, encoded columns.
- **Delete columns**: Delete unnecessary columns.
- **Align misaligned columns**: The data set may have shifted columns, which you need to align correctly.


### Segment- 4, Impute/Remove missing values 

Takeaways from the lecture on missing values:

- **Set values as missing values**: Identify values that indicate missing data, for example, treat blank strings, "NA", "XX", "999", etc., as missing.
- **Adding is good, exaggerating is bad**: You should try to get information from reliable external sources as much as possible, but if you can’t, then it is better to retain missing values rather than exaggerating the existing rows/columns.
- **Delete rows and columns**: Rows can be deleted if the number of missing values is insignificant, as this would not impact the overall analysis results. Columns can be removed if the missing values are quite significant in number.
- **Fill partial missing values using business judgement**: Such values include missing time zone, century, etc. These values can be identified easily.

Types of missing values:
- **MCAR**: It stands for Missing completely at random (the reason behind the missing value is not dependent on any other feature).
- **MAR**: It stands for Missing at random (the reason behind the missing value may be associated with some other features).
- **MNAR**: It stands for Missing not at random (there is a specific reason behind the missing value).


### Segment- 5, Handling Outliers 

**Outliers** - values that are beyond normal values'

Types of Outiliers

- **Univariate outliers** - single variable
  -  Univariate outliers are those data points in a variable whose values lie beyond the range of expected values.
  
- **Multivariate outlier**- multiple outlier
  - While plotting data, some values of one variable may not lie beyond the expected range, but when you plot the data with some other variable, these values may lie far from the expected value. These are called multivariate outliers.


**1.Causes of outliers**
 - Investigate the cause
 
**2.Treatment of outliers** 

Major approaches to the treat outliers:
 		
- **Imputation** : exclude that value from particular analysis
- **Deletion of outliers** 
- **Binning of values**
- **Cap the outlier** : cap the high value to less high value

### Segment- 6, Standardising values 

Checklist for data standardization exercises:
- **Standardise units**: Ensure all observations under one variable are expressed in a common and consistent unit, e.g., convert lbs to kg, miles/hr to km/hr, etc.
- **Scale values if required**: Make sure all the observations under one variable have a common scale.
- **Standardise precision** for better presentation of data, e.g., change 4.5312341 kg to 4.53 kg.
- **Remove extra characters** such as common prefixes/suffixes, leading/trailing/multiple spaces, etc. These are irrelevant to analysis.
- **Standardise case**: String variables may take various casing styles, e.g., UPPERCASE, lowercase, Title Case, Sentence case, etc.
- **Standardise format**: It is important to standardise the format of other elements such as date, name, etce.g., change 23/10/16 to 2016/10/23, “Modi, Narendra” to “Narendra Modi", etc.

### Segement-7, Fixing Invalid Values and Filtering Data

- **Encode unicode properly:** In case the data is being read as junk characters, try to change the encoding. For example, use CP1252 instead of UTF-8.
    
- **Convert incorrect data types:** Change the incorrect data types to the correct data types for ease of analysis. For example, if numeric values are stored as strings, then it would not be possible to calculate metrics such as mean, median, etc. Some of the common data type corrections include changing a string to a number (“12,300” to “12300”), a string to a date (“2013-Aug” to “2013/08”), and a number to a string (“PIN Code 110001” to “110001”).

- **Correct the values that lie beyond the range:** If some values lie beyond the logical range, for example, temperature less than -273° C (0° K), then you would need to correct those values as required. A closer look at the data set would help you determine whether there is scope for correction or the value needs to be removed.

- **Correct the values not belonging to the list:** Remove the values that do not belong to a list. For example, in a data set of blood groups of individuals, strings ‘E’ or ‘F’ are invalid values and can be removed.

- **Fix incorrect structure:** Values that do not follow a defined structure can be removed from a data set. For example, in a data set containing the pin codes of Indian cities, a pin code of 12 digits would be an invalid value and must be removed. Similarly, a phone number of 12 digits would be an invalid value.

- **Validate internal rules:** Internal rules, if present, should be correct and consistent. For example, the date of a product’s delivery should definitely come after the date of purchase.

### Filtering Data

- **Delete duplicate data:** Remove identical rows and the rows in which some columns are identical

- **Filter rows:** Filter rows by segment and date period to obtain only rows relevant to the analysis

- **Filter columns:** Filter columns relevant to the analysis

- **Aggregate data:** Group by the required keys and aggregate the rest

### Summary

- **Fixing the rows and columns:** You need to remove the irrelevant columns and heading lines from the data set. The irrelevant columns or rows are those that are of absolutely no use for analysis on the data set. Like in the Bank Marketing dataset, the headers and customer ID columns are of absolutely no use.

- **Remove/impute the missing values:** There are different types of missing values in the data set. Based on their type and origin, you need to decide whether they can be removed (if their percentage is too low), or whether they can be considered as a separate category. There is an important possibility where you need to impute missing values with some other value. While doing imputation, one should be very careful because it should not add any wrong information into the data set. The imputation can be done using mean, median, mode or using quantile analysis.

- **Outlier handling:** Outliers are points that are beyond the normal trend. There are two types of outliers: 
   - Univariate
   - Multivariate
    
- outliers should not always be treated as anomalies in the data set. 


- **Standardising values:** Sometimes, there are many entries in the data set that are not in the correct format. As you have seen in the Bank Marketing dataset, the duration of the call is in seconds and minutes. It has to be in the same format. The other standardisation involves unit and precision standardisation.

- **Fixing invalid values:** Sometimes, there are some values in the data set that are invalid, maybe in the form of their unit, range, data type, format, etc. It is essential to deal with these types of irregularities before processing the data set.

- **Filter data:** Sometimes, filtering out certain details can help you get a clearer picture of the data set.

## Session- 3, Univariate Analysis 

**Definition:**Analysing/ivualising a single variable
    
- Infer knowledge and handle missing values/outliers in a single feature    

- Univariate analysis is broadly of the following three types:

  - Categorical Unordered Univariate Analysis
  - Categorical Ordered Univariate Analysis
  - Statistics on Numerical Features

### Segment- 2, Categorical unordered univariate analysis 

Unordered data do not have the notion of high-low, more-less etc. 
 - Unordered variables are also called nominal variables.

Example:
- Type of loan taken by a person = home, personal, auto etc.
- Organisation of a person = Sales, marketing, HR etc.
- Job category of persone.
- Marital status of any one.

### Segment- 3, Categorical ordered univariate analysis 

Ordered variables have some kind of ordering. Some examples of bank marketing dataset are:
- Age group= <30, 30-40, 40-50 and so on.
- Month = Jan-Feb-Mar etc.
        - Education = primary, secondary and so on.

### Segement-4, Statistics on Numerical Features

**Binning:**
One can treat numeric variables as ordered categorical variables. You can deliberately convert the numeric variables into ordered categoricals for analysis. If, for example, you have the income of a few thousand people, which ranges from $5,000 to $1,00,000, you can categorise them into bins such as [5000, 10000], [10000,15000], [15000, 20000], etc.

This is called ‘binning’.

- Graphical representation of numerical data points through their media and quartiles

- the mean gives an average of all the values, the median gives a typical value that can be used to represent the entire group

- box plots are used to understand the spread of data.

- Both standard deviation and interquartile difference are used to represent the spread of the data.

- The interquartile difference is a much better metric than standard deviation if there are outliers in the data because the standard deviation will be influenced by outliers, while the interquartile difference will simply ignore them.

### Summary 

- **Categorical unordered univariate analysis:** Unordered variables are variables that do not contain any notion of ordering, such as increasing or decreasing order. These are just various types of any category. Examples can be job types, marital status, blood groups, etc.

- **Categorical ordered univariate analysis:** Ordered variables are variables that follow some kind of ordering, like high-low, fail-success, yes-no. Examples can be education level, salary group like high or low, gradings in any exam, etc.

- **Numerical variable univariate analysis:** Numerical variables can be classified into continuous and discrete. To analyse numerical variables, you need to know statistic metrics like mean, median, mode, quantiles, box plots, etc. It is important to understand that numerical variable univariate analysis is nothing but what we have done earlier, i.e., the treatment of missing values and handling outliers. The crux of univariate analysis lies in the single variable analysis, which was covered in the process of cleaning the data set.

- **The transition of a numerical variable into a categorical variable:** This is an important aspect that you need to think about before performing univariate analysis. Sometimes, it is essential to convert numerical variables into categorical ones through a process called binning.

## Session- 4, Bivariate and Multivariate Analysis

- #This session has been divided into the following topics based on the different types of variables:

- Analysis between two numeric variables
- Correlation vs causation
- Analysis between numeric and categorical variables
- Analysis between two categorical variables
- Multivariate analysis

### Segment-2, Numeric- numeric analysis 

There are three ways to analyse the numeric- numeric data types simultaneously.
- **Scatter plot**: describes the pattern that how one variable is varying with other variable.
- **Correlation matrix**: to describe the linearity of two numeric variables.
   - The correlation matrix quantifies only the linear dependence between the variables; it does not capture the non-linear relations between them.
   - The correlation coefficient is sensitive to outliers.
   - The higher the value (absolute value) of the coefficient of correlation between numerical variables, the higher the linear relation between them.

- **Pair plot**: group of scatter plots of all numeric variables in the data frame.

### Segment-3, Correlation vs Causation

- Correlation does not imply causation
- **Correlation vs causation: This is a very important concept of data analysis, which states that correlation is not always related to causation. Although there may be a very high correlation between variables, there may be no causation at all.

- Example
**1** The number of people who drowned by falling into a pool is not related to movies starring Nicolas Cage. However, if you observe the plot below, you will notice a very high correlation between them, as both the plots follow almost the same path.
**2** Similarly, in the example below, it is obvious that the per capita cheese consumption has no relation with people dying from being tangled in bedsheets. However, the plot shows a high correlation between them.


### Segment- 4, Numerical categorical variable

**Key takeaways**
- Use group by for a proper analysis of mean and median for numerical column over the categorical column 
- Use box plot or bar plot to represent those mean/median

- **Analysis between numerical and categorical variables:** This gives an idea about the variation of a particular numerical variable with respect to different categories of a categorical variable. Boxplot is the best way to look at a numerical variable with respect to a categorical variable. However, boxplots may sometimes not be useful because of the huge difference between the maximum and minimum values in the data set, or due to the higher concentration of data in the numerical variable. Another approach could be to look into the mean/median or quartiles, which are a more efficient way to deal with a numerical variable when combined with a categorical variable

### Segment- 5, Categorical categorical variable 

- **Analysis between two categorical variables:** A bar graph is the best approach to analysing two categorical variables.

- One of the interesting examples, also covered in the bank marketing dataset, is that the bank has mostly contacted people in the age group of 30–50, although people in the age group of 60+ gave more positive responses among all the age groups. This is a very important inference that the bank can draw, i.e., it should contact more individuals in the age group of 60+.    

### Segemnt-6, Multivariate Analysis

-  Multivariate analysis yields very specific information about a data set. It basically involves the analysis of more than two variables at a time. For instance, heat maps are the best way to look at three variables at a time. In multivariate analysis, it is essential to look into the data by grouping the variables and inferring decisions from them.

## MODULE SUMMARY

- Exploratory Data Analysis (EDA) helps a data analyst look beyond the data. It is a never-ending process – the more you explore the data, the more insights you draw from it. As a data analyst, almost 80% of your time will be spent understanding data and solving various business problems through EDA. If you understand EDA properly, half the battle is won.

 

- Now, one thing that you should keep in mind is that EDA is far more than just a plain visualisation. It is an end-to-end process to analyse a data set and prepare it for model building.


- In this module, you have learnt about some crucial steps in any kind of data analysis. These steps include the following:

**Gather data for analysis:** In the data sourcing part, you learnt about the various sources of data. There are majorly two types of data sources, namely public data and private data. Private data is associated with some security and privacy concerns, whereas public data is freely available to use without any restrictions on access or usage. There are many websites that provide access to a public dataset. You have also learnt about the basics of web scraping – a process to fetch the data from a web page directly.

**Preparation and cleaning of data:** In the cleaning process, the main objective is to remove irregularities from a data set. There are many ways to clean data, but the two most important approaches that you learnt are the treatment of missing values and outlier handling.
 
- Now, there are many ways to deal with missing values, for example, removing an entire column or rows with missing values; however, you need to keep in mind that it should not hamper the data with loss of information. The other method to deal with missing values is to just impute them with other values such as mean, median, mode or quantiles. The third method is to treat the missing values as a separate category; this is the safest method to deal with missing values.


- Next, you learnt about the different methods for analysing variables. These methods include the following:

**Univariate analysis:** Univariate analysis involves the analysis of a single variable at a time. Now, there are multiple types of variables, such as categorical ordered and unordered variables and numerical variables. A univariate analysis gives insights about a single variable and how it varies, and what are the counts of each and every category in it.

**Bivariate and multivariate analysis:** Bivariate/multivariate analysis involves analysing two or more variables at the same time. These analyses yield very specific insights about a data set. You can infer various findings after bivariate analysis.