# Questions
1. Analyse the Data Quality of Retail Datasets
2. For columns with missing values, find the count of missing values
3. Find the number of Respondents of a recent Email campaign
4. Provide the list of Repondents in a Excel format with Customer details 
5. Transactions data takes too much space. Can you reduce the size for storing?

# Data Import and Export

In [73]:
#Import Libraries
import pandas as pd

## Analyse the Data Quality of Retail Datasets
**Concepts Covered:**
1. Reading comma seperated files (.read_csv())
2. Gathering meta data from Dataframes (.info())
3. Extracting Summary Statistics from Dataframes (.describe())
4. Showing descriptive statistics for categorical variables (include = 'all')
5. (.shape) to find the number of rows and columns in a Dataframe 
6. Investigate datatypes in a dataframe using (.dtypes) and (.column)
7. Convert Date columns to datetime format (.to_datetime())

### Reading comma seperated files (.read_csv())

In [74]:
#We can import a CSV file, provided its present in the same folder as this notebook
customers = pd.read_csv('Retail_Data_Customers.csv')

In [75]:
#Similarly we can import the transactions and summary data
transactions = pd.read_csv('Retail_Data_Transactions.csv')
summary = pd.read_csv('Retail_Data_Customers_Summary.csv')

### Gathering meta data from Dataframes (.info())

In [76]:
#Using info() will provide us with the column names and their data types
summary.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6889 entries, 0 to 6888
Data columns (total 13 columns):
customer_id           6889 non-null object
tran_amount_2011      6501 non-null float64
tran_amount_2012      6809 non-null float64
tran_amount_2013      6804 non-null float64
tran_amount_2014      6801 non-null float64
tran_amount_2015      4225 non-null float64
transactions_2011     6501 non-null float64
transactions_2012     6809 non-null float64
transactions_2013     6804 non-null float64
transactions_2014     6801 non-null float64
transactions_2015     4225 non-null float64
First_Transaction     6889 non-null object
Latest_Transaction    6889 non-null object
dtypes: float64(10), object(3)
memory usage: 699.8+ KB


### Extracting Summary Statistics from Dataframes (.describe())

In [77]:
#Using describe() will provide us with descriptive statistics of the dataframe
summary.describe()

Unnamed: 0,tran_amount_2011,tran_amount_2012,tran_amount_2013,tran_amount_2014,tran_amount_2015,transactions_2011,transactions_2012,transactions_2013,transactions_2014,transactions_2015
count,6501.0,6809.0,6804.0,6801.0,4225.0,6501.0,6809.0,6804.0,6801.0,4225.0
mean,206.174281,310.853136,314.134039,307.970593,103.0,3.177665,4.779116,4.835979,4.738421,1.578935
std,128.092642,175.768279,175.743597,170.769072,66.204169,1.697319,2.26504,2.277427,2.192086,0.840771
min,10.0,10.0,10.0,10.0,10.0,1.0,1.0,1.0,1.0,1.0
25%,104.0,175.0,175.0,175.0,56.0,2.0,3.0,3.0,3.0,1.0
50%,184.0,288.0,293.0,288.0,85.0,3.0,5.0,5.0,5.0,1.0
75%,282.0,426.0,430.0,418.0,136.0,4.0,6.0,6.0,6.0,2.0
max,849.0,1242.0,1317.0,1029.0,538.0,11.0,18.0,17.0,15.0,7.0


In [78]:
#Let's check it for customers dataframe
customers.describe()

Unnamed: 0,Age
count,6830.0
mean,38.879795
std,10.541763
min,18.0
25%,32.0
50%,37.0
75%,44.0
max,92.0


### Showing descriptive statistics for categorical variables (include = 'all')

In [79]:
#We can get the descriptive statistics for the categorical columns too
customers.describe(include = 'all')

Unnamed: 0,customer_id,Name,Geography,Gender,Age
count,6901,6901,6901,6901,6830.0
unique,6901,6901,3,2,
top,CS8589,Greece D,France,Male,
freq,1,1,3466,3739,
mean,,,,,38.879795
std,,,,,10.541763
min,,,,,18.0
25%,,,,,32.0
50%,,,,,37.0
75%,,,,,44.0


### (.shape) to find the number of rows and columns in a Dataframe

In [80]:
#Using shape we can find out the number of unique customers in the Dataframe
customers.shape

(6901, 5)

In [81]:
#Transaction will have a huge number of records made by each of the customers in the above table
transactions.shape

(125000, 3)

In [82]:
#Summary of transaction as customer level, 2 customers did not make purchase from Retail Shop
summary.shape

(6889, 13)

### Investigate datatypes in a dataframe using (.dtypes) and (.column)

In [83]:
#Lets check the transaction data 
transactions.head(1)

Unnamed: 0,customer_id,trans_date,tran_amount
0,CS5295,11-Feb-13,35


In [84]:
#Instead info(), we could use (.dtypes) to get the Datatypes of all the columns
transactions.dtypes

customer_id    object
trans_date     object
tran_amount     int64
dtype: object

In [85]:
#Similary we could get the list of column names using (.columns)
transactions.columns

Index(['customer_id', 'trans_date', 'tran_amount'], dtype='object')

### Convert Date columns to datetime format (.to_datetime())

In [86]:
#Lets convert trans_date column to datetime datatype
transactions['trans_date'] = pd.to_datetime(transactions['trans_date'])

In [87]:
#Lets check the datatypes now
transactions.dtypes

customer_id            object
trans_date     datetime64[ns]
tran_amount             int64
dtype: object

In [88]:
#Similarly we could do the same for date columns in summary dataframe
summary.head(1)

Unnamed: 0,customer_id,tran_amount_2011,tran_amount_2012,tran_amount_2013,tran_amount_2014,tran_amount_2015,transactions_2011,transactions_2012,transactions_2013,transactions_2014,transactions_2015,First_Transaction,Latest_Transaction
0,CS2945,153.0,516.0,173.0,1029.0,40.0,2.0,7.0,3.0,13.0,1.0,18-May-11,08-Mar-15


In [89]:
#Lets convert trans_date column to datetime datatype
summary['First_Transaction'] = pd.to_datetime(summary['First_Transaction'])
summary['Latest_Transaction'] = pd.to_datetime(summary['Latest_Transaction'])

In [90]:
#Lets check the datatypes
summary.dtypes

customer_id                   object
tran_amount_2011             float64
tran_amount_2012             float64
tran_amount_2013             float64
tran_amount_2014             float64
tran_amount_2015             float64
transactions_2011            float64
transactions_2012            float64
transactions_2013            float64
transactions_2014            float64
transactions_2015            float64
First_Transaction     datetime64[ns]
Latest_Transaction    datetime64[ns]
dtype: object

## For columns with missing values, find the count of missing values
**Concepts Covered:**
1. Using (.isnull()) to find missing values
2. Using (.isnull() & .sum()) to find count of missing values

In [91]:
#lets check for missing values in Retail Summary
summary.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6889 entries, 0 to 6888
Data columns (total 13 columns):
customer_id           6889 non-null object
tran_amount_2011      6501 non-null float64
tran_amount_2012      6809 non-null float64
tran_amount_2013      6804 non-null float64
tran_amount_2014      6801 non-null float64
tran_amount_2015      4225 non-null float64
transactions_2011     6501 non-null float64
transactions_2012     6809 non-null float64
transactions_2013     6804 non-null float64
transactions_2014     6801 non-null float64
transactions_2015     4225 non-null float64
First_Transaction     6889 non-null datetime64[ns]
Latest_Transaction    6889 non-null datetime64[ns]
dtypes: datetime64[ns](2), float64(10), object(1)
memory usage: 699.8+ KB


### Using (.isnull()) to find missing values

In [92]:
#We can get the count of missing values using (.isnull())
summary.isnull()

Unnamed: 0,customer_id,tran_amount_2011,tran_amount_2012,tran_amount_2013,tran_amount_2014,tran_amount_2015,transactions_2011,transactions_2012,transactions_2013,transactions_2014,transactions_2015,First_Transaction,Latest_Transaction
0,False,False,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,True,False,False,False,False,True,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,True,False,False,False,False,True,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
6884,False,False,False,False,True,True,False,False,False,True,True,False,False
6885,False,False,False,False,True,True,False,False,False,True,True,False,False
6886,False,False,False,False,True,True,False,False,False,True,True,False,False
6887,False,False,False,False,True,True,False,False,False,True,True,False,False


### Using (.isnull() & .sum()) to find count of missing values

In [105]:
#We now get the count of True's using the (.sum())
summary.isnull().sum()

customer_id              0
tran_amount_2011       388
tran_amount_2012        80
tran_amount_2013        85
tran_amount_2014        88
tran_amount_2015      2664
transactions_2011      388
transactions_2012       80
transactions_2013       85
transactions_2014       88
transactions_2015     2664
First_Transaction        0
Latest_Transaction       0
dtype: int64

In [94]:
#Similarly we could check it for customers
customers.isnull().sum()

customer_id     0
Name            0
Geography       0
Gender          0
Age            71
dtype: int64

## Find the number of Respondents of a recent Email campaign
**Concepts Covered:**
1. Import other formats of data (json)

### Import JSON format data into Pandas

In [95]:
#Import Email response data which is in JSON format 
response = pd.read_json('Retail_Data_Response.json')

In [96]:
#Lets check the data
response

Unnamed: 0,customer_id,response
0,CS1112,0
1,CS1113,0
2,CS1114,1
3,CS1115,1
4,CS1116,1
...,...,...
6879,CS8996,0
6880,CS8997,0
6881,CS8998,0
6882,CS8999,0


In [103]:
#We will sum up the response column to get the number of respondents
response['response'].sum()

647

## Provide the list of Repondents in a Excel format with Customer details  
**Concepts Covered:**
1. Export Single Dataframe to Excel (.to_Excel())
1. Export Multiple Dataframe to Excel (.ExcelWriter())

### Export Single Dataframe to Excel (.to_excel())

In [98]:
#If we want to export the JSON data to Excel we could use the .to_excel() method
response.to_excel('Retail_Data_Single_Export.xlsx',sheet_name='Response')

In [99]:
#We could remove the Dataframe index, using the (index = False) argument
response.to_excel('Retail_Data_Single_Export.xlsx',sheet_name='Response', index = False)

### Export Multiple Dataframe to Excel (.ExcelWriter())

In [100]:
#Create a Pandas Excel Writer using (.ExcelWriter())
writer = pd.ExcelWriter('Retail_Data_Export.xlsx', engine='xlsxwriter')

In [101]:
#Export mutiple dataframes using different sheet name
customers.to_excel(writer, sheet_name='Customers', index = False)
response.to_excel(writer, sheet_name='Response', index = False)

In [102]:
#Close the Excel Writer and output the Excel file
writer.save()

## Transactions data takes too much space. Can you reduce the size for storing?
1. Export data with compression using (compression = 'gzip') argument
2. Import compressed(.gz) files

In [109]:
#While exporting we can compress and store data
transactions.to_csv('Retail_Data_Transactions_Compressed.gz', compression='gzip', index=False)

In [110]:
#We can import is using the (.read_csv()) method
read_gz_transactions = pd.read_csv('Retail_Data_Transactions_Compressed.gz')

In [111]:
read_gz_transactions.head()

Unnamed: 0,customer_id,trans_date,tran_amount
0,CS5295,2013-02-11,35
1,CS4768,2015-03-15,39
2,CS2122,2013-02-26,52
3,CS1217,2011-11-16,99
4,CS1850,2013-11-20,78


# END
**Pandas Concepts Covered:**
1. Reading data into Pandas Dataframes (.read_csv(), .read_json())
2. Investigating data (.info(), .describe(), dtypes, shape, columns)
3. Converting Date columns to date time format (.to_datetime())
4. Finding count of missing values in a column (.isnull() and .sum())
5. Export dataframe(dataframes) (.to_csv(), ExcelWriter())
6. Compressing data while exporting