### EDA

* A set of procedures for examining data
* Summarizing the data with the help of descriptive and graphical tools
* 'No assumption' study of data
* Determining relationships among variables
* Handling missing values

* **Primary Aim** - Performing quantitative and qualitiative evaluation of the data to draw meaningfule insights from it

### Steps involved in EDA

1. Collect & organize data
2. import data
3. Pre-process data
4. Explore & summarize data
5. Develop insights from data

### 1. Collect & organize data

1. Gather and understand the relevant data
2. Identify the type of each data
3. Identify the type of attributes in data

#### classify data 

1. Structured data
2. Semi-structured data
3. Unstructured data

#### Attributes in structured data

1. Quantitative data \
a. Discrete \
b. Continuous

2. Qualitative data \
a. Binary data \
b. nominal data \
c. Ordinal data

#### importing libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

print("libraries imported successfully")

libraries imported successfully


#### import structured data

In [2]:
user_rating = pd.read_csv("EDA_Master/Data/user_rating.csv")
user_rating.head(5)

Unnamed: 0,userID,placeID,food_rating,service_rating
0,U1077,135085,2,2
1,U1077,135038,2,1
2,U1077,132825,2,2
3,U1077,135060,2,2
4,U1068,135104,1,2


In [3]:
consumer_survey = pd.read_excel("EDA_Master/Data/consumer_survey.xlsx")
consumer_survey.head(5)

Unnamed: 0,gender,smoker,drink_level,dress_preference,ambience,transport,marital_status,interest,personality,religion,activity,income,FRVPM,AERPM
0,male,False,abstemious,informal,family,on foot,single,variety,thrifty-protector,none,student,medium,12,2976
1,female,False,abstemious,informal,family,public,single,technology,hunter-ostentatious,Catholic,student,low,12,3648
2,female,False,social drinker,formal,family,public,single,none,hard-worker,Catholic,student,low,3,1461
3,male,False,abstemious,informal,family,public,single,variety,hard-worker,none,professional,medium,18,4014
4,female,False,abstemious,no preference,family,public,single,none,thrifty-protector,Catholic,student,medium,15,3045


In [4]:
restaurant_parking = pd.read_excel("EDA_Master/Data/restaurant_parking.xlsx")
restaurant_parking.head(5)

Unnamed: 0,placeID,parking_lot
0,135111,public
1,135110,none
2,135109,none
3,135108,none
4,135107,none


In [5]:
restaurant_cuisine = pd.read_excel("EDA_Master/Data/restaurant_cuisine.xlsx")
restaurant_cuisine.head(5)

Unnamed: 0,placeID,Rcuisine
0,132001,Dutch-Belgian
1,132002,Seafood
2,132003,International
3,132004,Seafood
4,132005,French


### Python supports for SQL

SQLite is a lightweight database and it is shipped by default along with python version 2.5.x onwards.

Installing sqlite \
pip install db-sqlite3

In [1]:
#import sqlite3 library

import sqlite3

#create a connection object that can be used to connect to a database 'demo_database'

conn = sqlite3.connect('demo_database.db')

In [2]:
import pyodbc #python open database connectivity - used to connect any database

#### import ms access data

In [3]:
[i for i in pyodbc.drivers() if i.startswith('Microsoft Access Driver')]
['Microsoft Access Driver (*.mdb, *.accdb)']

['Microsoft Access Driver (*.mdb, *.accdb)']

In [4]:
conn_str = (r'Driver={Microsoft Access Driver (*.mdb, *.accdb)};DBQ=EDA_Master/Data/restaurant_cuisine.accdb;')
conn = pyodbc.connect(conn_str)
cuisine = pd.read_sql('select * from cuisine', conn)
conn.close()
cuisine.head(5)

InterfaceError: ('IM002', '[IM002] [Microsoft][ODBC Driver Manager] Data source name not found and no default driver specified (0) (SQLDriverConnect)')

importing data from oracle database

In [5]:
import cx_Oracle #enables access to oracle database
host = 'localhost'
port = 1521
SID = 'xe'
dsn_tns = cx_Oracle.makedsn(host,port,SID)
conn = cx_Oracle.connect('DBUser1','data1',dsn_tns)
restaurant_details = pd.read_sql("select * from resturant_details",conn)
conn.close()
restaurant_details.head(5)

DatabaseError: DPI-1047: Cannot locate a 64-bit Oracle Client library: "The specified module could not be found". See https://cx-oracle.readthedocs.io/en/latest/user_guide/installation.html for help

#### XML data
XML - eXtensible Markup Language - It is made up of user defines tags which helps in organizing and identifying the data.

In [10]:
import xml.etree.ElementTree as ET

#### importing semi structured data

In [11]:
tree = ET.parse('EDA_Master/Data/reviews.xml')
root = tree.getroot()
print(root.tag, "Id:",root.attrib['id'])

Restaurant Id: 41955207


In [12]:
restaurantDict = {}
reviewDict = {}
specialFeatureList = []
reviewList = []

for elem in root:
    restaurantDict.update({elem.tag:elem.text})
    for subelem in elem:
        if elem.tag == "SpecialFeatures":
            specialFeatureList.append(subelem.text)
            restaurantDict.update({"SpecialFeatures":specialFeatureList})
        if elem.tag == "Reviews":
            reviewDict = {}
            reviewDict.update({"ReviewID":subelem.attrib['id']})
            for subsubelem in subelem:
                reviewDict.update({subsubelem.tag:subsubelem.text})
            reviewList.append(reviewDict)
            restaurantDict.update({"reviews":reviewList})


In [13]:
print(root.tag,"Id:",root.attrib['id'])
print("-----------------------------------------------------------------------------")

for key,val in restaurantDict.items():
    if key=="Reviews":
        print(key,":")
        for reviews in val:
            for keyR, valR in reviews.items():
                print(keyR,":",valR)
            print("\n-----------------------------------------------------")
    else:
        print(key,":",val)

Restaurant Id: 41955207
-----------------------------------------------------------------------------
Name : Staghorn Steakhouse
ZipCode : 10018
Cuisines : Seafood, Steakhouse
PriceLevel : None
Hours : None
Payment : None
DressCode : None
SpecialFeatures : ['Notable Wine List', 'Business Dining', 'Romantic Dining', 'Group Dining', 'Fine Dining']
PromptSeating : yes
MakeReservation : yes
Romantic : yes
GoodForKids : no
GoodForGroups : yes
Reviews :


AttributeError: 'str' object has no attribute 'items'

#### importing unstructured data

In [14]:
feedback = pd.read_table("EDA_Master/Data/feedback.txt",names=["Reviews"])
feedback.head()

Unnamed: 0,Reviews
0,The food for our event was delicious .
1,The food in the lounge was great and very fre...
2,"As far as food, walk a few blocks toward Mich..."
3,The Palm resturant in the hotel had some spec...
4,Took the charge of the minibar which we had u...


## Pre-process data

* Data cleaning
* Merging multiple dataframes
* Sub-setting data frames based on a condition
* Ordering data in ascending/descending order
* Reshaping dataframe

### Clean data

#### Check for inconsistencies in data

In [1]:
# restaurant_details["STATE"].value_counts()

In [None]:
# To replace inconsistency

# restaurant_details["STATE"] = ["Mexico" if x in [" mexico ", " Mexico "] else x for x in restaurant_details["STATE"]]

#### Check for negative values in data

In [3]:
# to check negative values

# print(len(consumer_survey.loc[consumer_survey.FRVPM<0,:]))

# to remove negative values

# replace it by 0

# res = [0 if i < 0 for i in data]

#### Check for missing values

In [4]:
#df.isna() -- cell wise na values present or not

#df.isna().any() -- column wise na values present or not

#df.isna().any(axis=1) --- row wise na values present or not

#df[colname].isna().any() -- for any column

#df.colname.mean(skipna=true) -- to get mean value by skipping na values

#df.dropna() --- to drop na values

#### Delete data

#### Delete rows

In [6]:
#df.replace('value',np.NaN, inplace=True) #To replace other values

#df.dropna(axis=0, inplace=True, thresh=13) #To drop rows with na values

#df.reset_index(inplace=True,drop=True) #To reset index

In [7]:
#replace multiple missing values

# missing_values = ['?',"' ',' NA ']
# for i in missing_values:
#       restataurant_details.replace(i, np.NaN, inplace=True)

#### Delete columns

In [8]:
#df.dropna(axis=1,inplace=True,thresh=60*1.1) -- 60% of 110

#### Impute data

In [9]:
#import package

#from sklearn.impute import SimpleImputer

In [10]:
### imputing data with constant value

#to get the index number where the data is missing
## df[df.isna().any(axis=1)].index.tolist()

# create an array of column
## arr = np.array(df["colname"])
## imputer = SimpleImputer(strategy="constant",fill_value="Mexico")
## df["colname"] = imputer.fit_transform(arr.reshape(-1,1))

In [11]:
### imputing data with mode value

# find the mode
## df["colname"].mode()[0]

# create an array of column
## arr = np.array(df["colname"])
## imputer = SimpleImputer(strategy="most_frequent")
## df["colname"] = imputer.fit_transform(arr.reshape(-1,1))

In [12]:
### imputing data with mean value

# find the mode
## df["colname"].mean()

# create an array of column
## arr = np.array(df["colname"])
## imputer = SimpleImputer(strategy="mean")
## df["colname"] = imputer.fit_transform(arr.reshape(-1,1))

In [14]:
### imputing data with mean value by groupby function

# calculate group means
## df[["colname1","colname2"]].groupby("colname2").mean()

## grouped_df = df[["colname1","colname2"]].groupby('colname2')
## mean_updated = grouped_df.transform(lambda x:x.fillna(x.mean()))
## mean_upddated["colname2"] = df["colname2"]

In [15]:
### imputing data with median value

# find the median
## df["colname"].median()[0]

# create an array of column
## arr = np.array(df["colname"])
## imputer = SimpleImputer(strategy="medin")
## df["colname"] = imputer.fit_transform(arr.reshape(-1,1))

In [16]:
### imputing data with median value by groupby function

# calculate group means
## df[["colname1","colname2"]].groupby("colname2").median()

## grouped_df = df[["colname1","colname2"]].groupby('colname2')
## median_updated = grouped_df.transform(lambda x:x.fillna(x.median()))
## median_upddated["colname2"] = df["colname2"]

#### Merge data

In [17]:
# df_merge = pd.merge(df1,df2.set_index('indexcolname'),left_on='indexcolname',right_index=True,how="inner")

#### subsetting data

In [18]:
# df.loc[:,['col1','col2','col3']].head()

#### ordering dataframe

In [19]:
## df.sort_values(by = ['colname'],ascending=True).head()

#### Reshape data

In [20]:
## Adding a feature to df

# df['newcolname'] = df['colname'] != 'condition'

#### melting a df

In [21]:
## df_melt = pd.melt(df, id_vars = ['col1','col2'], value_vars= ['col3','col4'], var_name = 'name1', value_name='name2')

#### casting a df

In [23]:
## pd.pivot_table(df_melt, index = 'indexcolname', columns = 'name1', values = 'col4', aggfunc = 'mean').head() --- col4 is the column which is to be aggregagted

### Explore and summarize data

#### Five number summary

In [24]:
# df.describe()

#### Box plot

#### Histogram

#### cross tab

In [26]:
#pd.crosstab(index=consumer_survey['gender'],columns='count',colnames=' ')

#### bar chart

### Developing insights from data

#### Plot and compare multiple box plots

#### Plot and compare multiple histograms

### Exploring relationship among qualitative variables

#### Pivot table
df.pivot_table(index ='...',values='...',columns='....',aggfunc='...')

#### Scatter plot

#### Simpson's paradox

It is a phenomena in which groups of data exhibit a particular trend, but this trend disappears or reverses, when the groups are combined together.


In [3]:
user_rating.describe()

Unnamed: 0,placeID,food_rating,service_rating
count,1161.0,1161.0,1161.0
mean,134192.041344,1.215332,1.090439
std,1100.916275,0.792294,0.790844
min,132560.0,0.0,0.0
25%,132856.0,1.0,0.0
50%,135030.0,1.0,1.0
75%,135059.0,2.0,2.0
max,135109.0,2.0,2.0
