<a href="https://colab.research.google.com/github/softhints/Pandas-Exercises-Projects/blob/main/project/Exploratory%20Data%20Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Importing Libraries

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Reading dataset

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/softhints/Pandas-Exercises-Projects/main/data/food_recipes.csv', low_memory=False)

## Explore Data

* `head(n)` - returns first n rows
* `tail(n)` - returns last n rows
* `sample(n)` - sample random n rows
* `df` - returns first and last 5 records, returns number of rows and columns

In [None]:
df.head(3)

In [None]:
df.sample(4)

In [None]:
df

## Initial findings

* Dataset has header column
* separator is ';'
* `NaN` represents missing values
* 2 columns with nested data - `ingredients` and `tags`

## Data Findings

* dataset contains - 8009 rows and 16 columns
* 2 numeric columns
* 2 which need conversion - `prep_time` and `cook_time`
    * convert string to int
    * 15 M -> 15

In [None]:
df.shape

In [None]:
df.columns

In [None]:
df.info()

In [None]:
df.dtypes

## Statistics on numeric columns

* mean value is close to 75%
* no missing values
* big difference between max and 75% percentile for `vote_count` - potential outliers

In [None]:
df.describe()

## Observations on object columns

* 3 categorical column - `category`, `cuisine`, `diet` and `course`
* 1 column with single value - `record_health`

In [None]:
df.describe(include='object')

## Inspect Single column

* 10 unique values
* categorical column
* `Vegetarian` - is the most frequent value
* 151 - missing values

In [None]:
df.diet.nunique()

In [None]:
df.diet.unique()

In [None]:
df.diet.value_counts(dropna=False)

## Convert String to numeric

In [None]:
df['cook_time'] = df['cook_time'].str.replace(' M', '').fillna(0).astype(int)
df['prep_time'] = df['prep_time'].str.replace(' M', '').fillna(0).astype(int)

## Expand nested column

In [None]:
df['ingredients'].str.split('|', expand=True).head().dropna(axis=1)

### Find the most common ingredient

In [None]:
df['ingredients'].str.split('|', expand=True).melt()['value'].value_counts()

## Visualization

### missing values

* data set has missing values in several columns
    * represented by darker green lines
* `cusiense`, `course` and `diet`
    * are missing for the last N records
    * there is pattern for missing values of those columns

In [None]:
df.isna()

In [None]:
sns.heatmap(df.isna(),cmap = 'Greens')

In [None]:
sns.heatmap(df.tail(3000).isna(),cmap = 'Greens')

### correlation of two columns

* no correlation of numerical columns for the whole DataFrame
* correlation for - Vegan recipies - between `prep_time` and `cook_time`
* darker color represent positive correlation
    * lighter - negative
* to get values use `annot=True`

In [None]:
sns.heatmap(df.corr(),cmap='Greens',annot=False)

In [None]:
sns.heatmap(df[df.diet=='Vegan'].corr(),cmap='Greens',annot=True)

## Check for Outliers

* `vote_count` - suggest outliers
* `cook_time` - has possible outliers
* `rating` - doesn't indicate outliers

In [None]:
cols = ['vote_count', 'rating',  'prep_time', 'cook_time']
plt.figure(figsize=(20, 5))
sns.boxplot(data=df[cols], orient='h')

In [None]:
cols = ['prep_time', 'cook_time']
plt.figure(figsize=(20, 5))
sns.boxplot(data=df[cols].head(18), orient='h').set(title='Detecting Outliers Explained')

In [None]:
df[cols].head(18).describe()

In [None]:
df[cols].head(18)