# Exploratory Data Analysis with Pandas

   #### Resources: 
   
   ##### You Tube 
       - https://www.youtube.com/watch?v=vmEHCJofslg
       - https://www.youtube.com/playlist?list=PL-osiE80TeTsWmV9i9c58mdDCSskIFdDS
   ##### Cheat sheets: 
       - https://www.enthought.com/wp-content/uploads/Enthought-Python-Pandas-Cheat-Sheets-1-8-v1.0.2.pdf
   ##### Towards Data Science
       - https://towardsdatascience.com/how-to-master-pandas-for-data-science-b8ab0a9b1042
       
   ##### The documentation: 
       - https://pandas.pydata.org/docs/index.html

__Series:__ A Pandas Series is a one-dimensional array that is very much similar to the NumPy array but with the ability to be labeled (i.e. the axis label or also called the index). A Series can hold an integer, float, string, python object, etc. At a high-level, a Series can be thought of as a column in Microsoft Excel.

__DataFrame:__ A Pandas DataFrame is a two-dimensional array. At a high-level, a DataFrame can be thought of as the spreadsheet in Microsoft Excel (i.e. a M × N matrix where M denotes the rows and N the columns).
(from https://towardsdatascience.com/how-to-master-pandas-for-data-science-b8ab0a9b1042)

In [None]:
import pandas as pd

### Create a dataframe

In [None]:
df = pd.DataFrame({"Number":[3,6,1], "Name":["Tom",  "Jane", "Lisa"]})
df

In [None]:
df = pd.DataFrame([[3, "Tom"],  [6, "Jane"], [1,"Lisa"]], columns=["Number", "Names"])
df

In [None]:
df.columns

In [None]:
df.columns = ["number", "name"]
df

In [None]:
df.index

In [None]:
df.index=["zero", "one", "two"]

In [None]:
df

### Appending rows and colums with concat()

In [None]:
df1 = pd.DataFrame(data={"number":8, "name":"John"}, index=["three"])
df1

In [None]:
df2 = pd.concat([df, df1])
df2

In [None]:
animal = pd.Series(["dog", "cat", "mouse"], name="animal")
animal

In [None]:
df3 = pd.concat([df, animal], axis=1)
df3

### Accessing rows and columns

__loc__ gets rows (and/or columns) with particular labels.

__iloc__ gets rows (and/or columns) at integer locations.

In [None]:
df3["Names"]

### load dataframe from .csv

In [None]:
df = pd.read_csv("starbucks_drinkMenu_expanded.csv")

### first impressions 

In [None]:
df.head(n=2)

In [None]:
df.sample(2)

In [None]:
df.columns

### identify missing values 

In [None]:
df.isna()

In [None]:
df2 = df.dropna()

### meaningful names are important

In [None]:
df.columns

In [None]:
df = df.rename(columns={" Sugars (g)": "sugars_g"})
df.columns

In [None]:
df.columns = df.columns.str.replace(" ", "")
df.columns = df.columns.str.replace("(", "_")
df.columns = df.columns.str.replace(")", "")
df.columns = df.columns.str.replace("%", "perc_")
df.columns

### Select and Filter

In [None]:
df.Beverage_category.unique()

In [None]:
df_coffee = df.loc[(df.Beverage_category=="Coffee")].copy() # copy is necessary to avoid settingwidthcopy error
df_coffee.head()

In [None]:
df_coffee = df[df.Beverage_category=="Coffee"]

In [None]:
df_coffee = df.query("Beverage_category=='Coffee'")

### Converting datatypes

In [None]:
df_coffee["Protein_g"] = df_coffee.Protein_g*1000

### Sorting

In [None]:
df_coffee.sort_values("Beverage_prep", ascending=False)

In [None]:
df.sort_values(["Beverage_prep", "Caffeine_mg"], ascending=False).head()

### recode the size to numeric and sort by size 

## Grouping

In [None]:
df.groupby("Beverage_category")["Beverage"].count()

In [None]:
df3 = df.groupby("Beverage_category")
df3.groups

### Pivot

In [None]:
df_smoothies = df[df.Beverage_category=="Smoothies"]

In [None]:
df_smoothies_mini = df_smoothies[["Beverage", "Beverage_prep", "TotalFat_g"]].copy()

In [None]:
df_smoothies_mini.shape

In [None]:
df_smoothies_mini=df_smoothies_mini.pivot(index = "Beverage",columns='Beverage_prep', values=["TotalFat_g"])
df_smoothies_mini.head()

### Plotting with pandas

In [None]:
df.Caffeine_mg.unique()

In [None]:
df1 = df[~df.Caffeine_mg.isin(["0", "varies", "Varies"])].copy()
df1 = df1[~df.Caffeine_mg.isna()].copy()
df1.Caffeine_mg = df1.Caffeine_mg.astype(float)

### a basic histogram

In [None]:
df1.Caffeine_mg.plot.hist()

### A basic scatter plot

In [None]:
df1.plot.scatter(x="sugars_g", y="Calories", alpha=0.4, edgecolor="black", s=60)

### A basic bar plot

In [None]:
df2 = df.loc[df.Beverage_category=="Coffee"]
df2.Caffeine_mg = df2.Caffeine_mg.astype(float)

In [None]:
df2.plot.bar(x="Beverage", y="Caffeine_mg", )

In [None]:
df2.plot.barh(x="Beverage", y="Caffeine_mg", )

### A grouped bar plot

In [None]:
df1.Beverage_category.unique()

In [None]:
df3 = df.loc[(df.Beverage_category=='Classic Espresso Drinks') & (df.Beverage.isin(['Caffè Latte','Vanilla Latte (Or Other Flavoured Latte)']))]

In [None]:
df3.groupby("Beverage").mean().plot.bar(y="Calories", color=["blue", "purple"])