In statistics, exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task. Exploratory data analysis was promoted by John Tukey to encourage statisticians to explore the data, and possibly formulate hypotheses that could lead to new data collection and experiments. EDA is different from initial data analysis (IDA),which focuses more narrowly on checking assumptions required for model fitting and hypothesis testing, and handling missing values and making transformations of variables as needed. EDA encompasses IDA.

In [1]:
import warnings
warnings.filterwarnings("ignore")

In [2]:
import pandas as pd

In [3]:
df= pd.read_csv('mba.csv')

In [4]:
type(df)

pandas.core.frame.DataFrame

In [5]:
df.dtypes

shrimp                object
almonds               object
avocado               object
vegetables mix        object
green grapes          object
whole weat flour      object
yams                  object
cottage cheese        object
energy drink          object
tomato juice          object
low fat yogurt        object
green tea             object
honey                 object
salad                 object
mineral water         object
salmon                object
antioxydant juice     object
frozen smoothie       object
spinach               object
olive oil            float64
dtype: object

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7500 entries, 0 to 7499
Data columns (total 20 columns):
shrimp               7500 non-null object
almonds              5746 non-null object
avocado              4388 non-null object
vegetables mix       3344 non-null object
green grapes         2528 non-null object
whole weat flour     1863 non-null object
yams                 1368 non-null object
cottage cheese       980 non-null object
energy drink         653 non-null object
tomato juice         394 non-null object
low fat yogurt       255 non-null object
green tea            153 non-null object
honey                86 non-null object
salad                46 non-null object
mineral water        24 non-null object
salmon               7 non-null object
antioxydant juice    3 non-null object
frozen smoothie      3 non-null object
spinach              2 non-null object
olive oil            0 non-null float64
dtypes: float64(1), object(19)
memory usage: 1.1+ MB


In [7]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7500 entries, 0 to 7499
Data columns (total 20 columns):
shrimp               7500 non-null object
almonds              5746 non-null object
avocado              4388 non-null object
vegetables mix       3344 non-null object
green grapes         2528 non-null object
whole weat flour     1863 non-null object
yams                 1368 non-null object
cottage cheese       980 non-null object
energy drink         653 non-null object
tomato juice         394 non-null object
low fat yogurt       255 non-null object
green tea            153 non-null object
honey                86 non-null object
salad                46 non-null object
mineral water        24 non-null object
salmon               7 non-null object
antioxydant juice    3 non-null object
frozen smoothie      3 non-null object
spinach              2 non-null object
olive oil            0 non-null float64
dtypes: float64(1), object(19)
memory usage: 1.1+ MB
None


In [8]:
df.head()

Unnamed: 0,shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil
0,burgers,meatballs,eggs,,,,,,,,,,,,,,,,,
1,chutney,,,,,,,,,,,,,,,,,,,
2,turkey,avocado,,,,,,,,,,,,,,,,,,
3,mineral water,milk,energy bar,whole wheat rice,green tea,,,,,,,,,,,,,,,
4,low fat yogurt,,,,,,,,,,,,,,,,,,,


In [9]:
print(df.head())

           shrimp    almonds     avocado    vegetables mix green grapes  \
0         burgers  meatballs        eggs               NaN          NaN   
1         chutney        NaN         NaN               NaN          NaN   
2          turkey    avocado         NaN               NaN          NaN   
3   mineral water       milk  energy bar  whole wheat rice    green tea   
4  low fat yogurt        NaN         NaN               NaN          NaN   

  whole weat flour yams cottage cheese energy drink tomato juice  \
0              NaN  NaN            NaN          NaN          NaN   
1              NaN  NaN            NaN          NaN          NaN   
2              NaN  NaN            NaN          NaN          NaN   
3              NaN  NaN            NaN          NaN          NaN   
4              NaN  NaN            NaN          NaN          NaN   

  low fat yogurt green tea honey salad mineral water salmon antioxydant juice  \
0            NaN       NaN   NaN   NaN           NaN    NaN

In [10]:
df.tail()

Unnamed: 0,shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil
7495,butter,light mayo,fresh bread,,,,,,,,,,,,,,,,,
7496,burgers,frozen vegetables,eggs,french fries,magazines,green tea,,,,,,,,,,,,,,
7497,chicken,,,,,,,,,,,,,,,,,,,
7498,escalope,green tea,,,,,,,,,,,,,,,,,,
7499,eggs,frozen smoothie,yogurt cake,low fat yogurt,,,,,,,,,,,,,,,,


In [11]:
df.tail(3)

Unnamed: 0,shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil
7497,chicken,,,,,,,,,,,,,,,,,,,
7498,escalope,green tea,,,,,,,,,,,,,,,,,,
7499,eggs,frozen smoothie,yogurt cake,low fat yogurt,,,,,,,,,,,,,,,,


In [12]:
df.sample

<bound method NDFrame.sample of                  shrimp            almonds            avocado  \
0               burgers          meatballs               eggs   
1               chutney                NaN                NaN   
2                turkey            avocado                NaN   
3         mineral water               milk         energy bar   
4        low fat yogurt                NaN                NaN   
5     whole wheat pasta       french fries                NaN   
6                  soup        light cream            shallot   
7     frozen vegetables          spaghetti          green tea   
8          french fries                NaN                NaN   
9                  eggs           pet food                NaN   
10              cookies                NaN                NaN   
11               turkey            burgers      mineral water   
12            spaghetti          champagne            cookies   
13        mineral water             salmon                

In [13]:
df.sample(10)

Unnamed: 0,shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil
4878,mineral water,,,,,,,,,,,,,,,,,,,
545,turkey,herb & pepper,chocolate,,,,,,,,,,,,,,,,,
5300,burgers,shrimp,mineral water,cake,chocolate,french fries,escalope,cookies,,,,,,,,,,,,
1586,soup,,,,,,,,,,,,,,,,,,,
5343,muffins,,,,,,,,,,,,,,,,,,,
5195,antioxydant juice,chocolate,french fries,shallot,,,,,,,,,,,,,,,,
239,meatballs,milk,,,,,,,,,,,,,,,,,,
7306,tomatoes,whole wheat pasta,ground beef,mineral water,soup,olive oil,black tea,salmon,cake,,,,,,,,,,,
4082,burgers,eggs,barbecue sauce,french fries,,,,,,,,,,,,,,,,
1173,chocolate,mineral water,,,,,,,,,,,,,,,,,,


In [14]:
df.index

RangeIndex(start=0, stop=7500, step=1)

In [15]:
df.describe()

Unnamed: 0,olive oil
count,0.0
mean,
std,
min,
25%,
50%,
75%,
max,


In [16]:
print(df.describe())

       olive oil
count        0.0
mean         NaN
std          NaN
min          NaN
25%          NaN
50%          NaN
75%          NaN
max          NaN


In [17]:
df.describe().round(3)

Unnamed: 0,olive oil
count,0.0
mean,
std,
min,
25%,
50%,
75%,
max,


In [18]:
df.corr()

Unnamed: 0,olive oil
olive oil,


In [19]:
df.columns

Index(['shrimp', 'almonds', 'avocado', 'vegetables mix', 'green grapes',
       'whole weat flour', 'yams', 'cottage cheese', 'energy drink',
       'tomato juice', 'low fat yogurt', 'green tea', 'honey', 'salad',
       'mineral water', 'salmon', 'antioxydant juice', 'frozen smoothie',
       'spinach', 'olive oil'],
      dtype='object')

In [20]:
#Checking for missing Data
for col in df.columns:
    print (col + ' ' +'missing values:' + str((df[col].isnull().sum())) or str(df[col].isna().sum()))

shrimp missing values:0
almonds missing values:1754
avocado missing values:3112
vegetables mix missing values:4156
green grapes missing values:4972
whole weat flour missing values:5637
yams missing values:6132
cottage cheese missing values:6520
energy drink missing values:6847
tomato juice missing values:7106
low fat yogurt missing values:7245
green tea missing values:7347
honey missing values:7414
salad missing values:7454
mineral water missing values:7476
salmon missing values:7493
antioxydant juice missing values:7497
frozen smoothie missing values:7497
spinach missing values:7498
olive oil missing values:7500


In [21]:
df.isnull().sum()

shrimp                  0
almonds              1754
avocado              3112
vegetables mix       4156
green grapes         4972
whole weat flour     5637
yams                 6132
cottage cheese       6520
energy drink         6847
tomato juice         7106
low fat yogurt       7245
green tea            7347
honey                7414
salad                7454
mineral water        7476
salmon               7493
antioxydant juice    7497
frozen smoothie      7497
spinach              7498
olive oil            7500
dtype: int64

## Saman Siadati - Feb 2020