## Exploratory Analysis

One of the main tasks to perform with Pandas is exploratory analysis. Looking at data, finding what is useful or potentially wrong with it so that you can clean it up are core practices of a data scientist and data engineer.

## Create a Pandas Dataframe 
Load a CSV to start working with the data and performing exploratory analysis.

In [3]:
pip install pandas


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m23.2.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3 install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [6]:
import pandas as pd
df = pd.read_csv("wine-ratings.csv", index_col=0)

In [7]:
# The most common operation is with .head() 
df.head(15)

Unnamed: 0,name,grape,region,variety,rating,notes
0,1000 Stories Bourbon Barrel Aged Batch Blue Ca...,,"Mendocino, California",Red Wine,91.0,"This is a very special, limited release of 100..."
1,1000 Stories Bourbon Barrel Aged Gold Rush Red...,,California,Red Wine,89.0,The California Gold Rush was a period of coura...
2,1000 Stories Bourbon Barrel Aged Gold Rush Red...,,California,Red Wine,90.0,The California Gold Rush was a period of coura...
3,1000 Stories Bourbon Barrel Aged Zinfandel 2013,,"North Coast, California",Red Wine,91.0,"The wine has a deep, rich purple color. An int..."
4,1000 Stories Bourbon Barrel Aged Zinfandel 2014,,California,Red Wine,90.0,Batch #004 is the first release of the 2014 vi...
5,1000 Stories Bourbon Barrel Aged Zinfandel 2016,,California,Red Wine,91.0,"1,000 Stories Bourbon barrel-aged Zinfandel is..."
6,1000 Stories Bourbon Barrel Aged Zinfandel 2017,,California,Red Wine,92.0,"Batch 55 embodies an opulent vintage, which sa..."
7,12 Linajes Crianza 2014,,"Ribera del Duero, Spain",Red Wine,92.0,Red with violet hues. The aromas are very inte...
8,12 Linajes Reserva 2012,,"Ribera del Duero, Spain",Red Wine,94.0,"On the nose, a complex predominance of mineral..."
9,14 Hands Cabernet Sauvignon 2010,,"Columbia Valley, Washington",Red Wine,87.0,Concentrated aromas of dark stone fruits and t...


In [8]:

# Now lets get a description of the data
df.describe()

Unnamed: 0,grape,rating
count,0.0,32780.0
mean,,91.186608
std,,2.190391
min,,85.0
25%,,90.0
50%,,91.0
75%,,92.0
max,,99.0


In [9]:
# You can also get metadata about the dataset with .info()
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 32780 entries, 0 to 32779
Data columns (total 6 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   name     32780 non-null  object 
 1   grape    0 non-null      float64
 2   region   32777 non-null  object 
 3   variety  32422 non-null  object 
 4   rating   32780 non-null  float64
 5   notes    32780 non-null  object 
dtypes: float64(2), object(4)
memory usage: 1.8+ MB


In [10]:
# sort based on some condition
df.sort_values(by="rating", ascending=False).head()

Unnamed: 0,name,grape,region,variety,rating,notes
9986,Chateau Angelus (Futures Pre-Sale) 2019,,"St. Emilion, Bordeaux, France",Red Wine,99.0,"This 2019 vintage, made while the estate was u..."
21597,Espectacle Espectacle del Montsant 2012,,Spain,Red Wine,99.0,Its color is surprisingly intense compared to ...
12857,Chateau Pavie (1.5 Liter Futures Pre-Sale) 2019,,"St. Emilion, Bordeaux, France",Red Wine,99.0,"Blend: 50% Merlot, 32% Cabernet Franc, 18% Cab..."
25936,Guigal La Turque Cote Rotie 2010,,"Cote Rotie, Rhone, France",Red Wine,99.0,La Turque displays deep ruby red color with da...
12856,Chateau Pavie (1.5 Liter Futures Pre-Sale) 2018,,"St. Emilion, Bordeaux, France",Red Wine,99.0,"Blend: 60% Merlot, 22% Cabernet Franc and 18% ..."


Remove any newlines or carriage returns

In [11]:
df = df.replace({"\r": ""}, regex=True)
df = df.replace({"\n": " "}, regex=True)
df.head(10)

Unnamed: 0,name,grape,region,variety,rating,notes
0,1000 Stories Bourbon Barrel Aged Batch Blue Ca...,,"Mendocino, California",Red Wine,91.0,"This is a very special, limited release of 100..."
1,1000 Stories Bourbon Barrel Aged Gold Rush Red...,,California,Red Wine,89.0,The California Gold Rush was a period of coura...
2,1000 Stories Bourbon Barrel Aged Gold Rush Red...,,California,Red Wine,90.0,The California Gold Rush was a period of coura...
3,1000 Stories Bourbon Barrel Aged Zinfandel 2013,,"North Coast, California",Red Wine,91.0,"The wine has a deep, rich purple color. An int..."
4,1000 Stories Bourbon Barrel Aged Zinfandel 2014,,California,Red Wine,90.0,Batch #004 is the first release of the 2014 vi...
5,1000 Stories Bourbon Barrel Aged Zinfandel 2016,,California,Red Wine,91.0,"1,000 Stories Bourbon barrel-aged Zinfandel is..."
6,1000 Stories Bourbon Barrel Aged Zinfandel 2017,,California,Red Wine,92.0,"Batch 55 embodies an opulent vintage, which sa..."
7,12 Linajes Crianza 2014,,"Ribera del Duero, Spain",Red Wine,92.0,Red with violet hues. The aromas are very inte...
8,12 Linajes Reserva 2012,,"Ribera del Duero, Spain",Red Wine,94.0,"On the nose, a complex predominance of mineral..."
9,14 Hands Cabernet Sauvignon 2010,,"Columbia Valley, Washington",Red Wine,87.0,Concentrated aromas of dark stone fruits and t...


In [12]:
# the grape is not a very good column, lets remove it and describe it again
df.drop(['grape'], axis=1, inplace=True)
df.describe()

Unnamed: 0,rating
count,32780.0
mean,91.186608
std,2.190391
min,85.0
25%,90.0
50%,91.0
75%,92.0
max,99.0


In [None]:
# Specific operations by method. Like .mean()
df.groupby("region").mean()