<h2 align=center>Exploratory Data Analysis With Python and Pandas</h2>
<img src="logo.png">

### Libraries

In [None]:
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
!pip install calmap
import calmap
!pip3 install pandas_profiling --upgrade
from pandas_profiling import ProfileReport

### Task 1: Initial Data Exploration

In [3]:
df=pd.read_csv("/content/sample_data/supermarket_sales - Sheet1.csv")

In [None]:
df.head(10)

In [None]:
df.dtypes


In [None]:
df['Date']=pd.to_datetime(df['Date'])

In [None]:
df.columns


In [8]:
df.set_index('Date',inplace= True) # inplace = true to signify permanent change to the data frame.

In [None]:
df.describe()

### Task 2: Univariate Analysis

**Question 1:** What does the distribution of customer ratings looks like? Is it skewed?

In [None]:
sns.distplot(df['Rating'])
plt.axvline(x=np.mean(df['Rating']),c='red',ls='--') # axvline for plotting vertical line 
plt.axvline(x=np.percentile(df['Rating'],25),c='green',ls='--',label='mean') # for plotting percentile 
plt.axvline(x=np.percentile(df['Rating'],75),c='green',ls='--',label='25-75')
plt.ylabel("Frequency")
plt.legend()
plt.show()

In [None]:
df.hist(figsize=(10,20))
plt.show()

In [None]:
plt.axvline(x=np.mean(df['Rating']),c='red',ls='--')
plt.show()

**Conclusion**

The frequency of ratings plot is not skewed in any direction 


**Question 2:** Do aggregate sales numbers differ by much between branches?

In [None]:
sns.countplot(x=df['Branch'])
plt.title('no of sales Branch wise')
plt.ylabel('#of sales')
plt.show()
df['Branch'].value_counts()

***Now, we are done with univariate analysis. So, lets go on analyzing two variables at a time. i.e finding correlations. The best visualizations for interpreting correlations are scatter plot and heatmap ***


### Task 3: Bivariate Analysis

**Question 3:** Is there a relationship between gross income and customer ratings?

In [None]:
sns.scatterplot(x=df['Rating'],y=df['gross income'])
sns.regplot(x=df['Rating'],y=df['gross income']) ## for trendline

BOX PLOT IS USED TO PLOT CATEGORICAL VARIABLE WITH NUMERICAL VARIABLE.

In [None]:
sns.boxplot(x=df['Branch'],y=df['gross income'])

AS VISIBLE FROM THE ABOVE PLOT, INTERESTINGLY, THERE IS NO CORRELATION BETWEEN THE GROSS INCOME AND CUSTOMER RATINGS.

**Question 4:** Is there a noticeable time trend in gross income?

In [None]:
sns.lineplot(x=df.groupby(df.index).mean().index,y=df.groupby(df.index).mean()['gross income'])

### Task 4: Dealing With Duplicate Rows and Missing Values

In [17]:
df.duplicated().sum()  ### duplicated is a boolean operator and attaching sum to it returns total no of duplicated values
df[df.duplicated()==True] # returns the entries in data frame where the entries are duplicated 
df.drop_duplicates(inplace=True)

In [None]:
df.isna().sum() # counts the no of missing values in each field, here na represents notavailable
len(df) 
sns.heatmap(df.isnull(), cbar= True)
df.fillna(df.mean(),inplace= True) # for replacing the numerical values only 

FOR CATEGORICAL VALUES, MODE CAN BE USED TO REPLACE THE MISSING VALUES 

In [None]:
df.mode()
df.fillna(df.mode().iloc[0], inplace= True)
df.head()

**Pandas profiling streamlines the entire EDA process** 

In [None]:

dataset= pd.read_csv("/content/sample_data/supermarket_sales - Sheet1.csv")
df1 = ProfileReport(dataset)
df1

### Task 5: Correlation Analysis

In [None]:
np.corrcoef(df['gross income'],df['Rating'])


In [None]:
sns.heatmap(np.round(np.corrcoef(df['gross income'],df['Rating']),2), annot= True)