# ETL_WineXL_CleanUP

### Comments on ETL_WineXL_CleanUp

### 1) Since the file we were cleaning up was a .xlxs, we needed to install openpyxl and then use that as our engine to read the excel file

### 2) Eliminated Vintage, County and Designation columns

### 3) Determined that there were incomplete rows and then eliminated those rows

### 4) Evaluated data to determine if the data was of the appropriate type. Noted that Price was an object rather than a float

### 5) Converted the price column to a float

### 6) Eliminated all wines that were not include in the top 6 favorite wines per the Art Wine Preserve website

### 7) Reindexed the DataFrame after eliminating non-top 6 wines

In [1]:
import pandas as pd
from sqlalchemy import create_engine
#!pip install openpyxl

### Store xlsx into a DataFrame

In [2]:
data_file = "Resources/Wines.xlsx"
df = pd.read_excel(data_file, engine='openpyxl')
df.head()

Unnamed: 0,Vintage,Country,County,Designation,Points,Price,Province,Title,Variety,Winery
0,1919-01-01 00:00:00,Spain,Cava,1919 Brut Selecció,88,$13.00,Catalonia,L'Arboc NV 1919 Brut Selecció Sparkling (Cava),Sparkling Blend,L'Arboc
1,1929-01-01 00:00:00,Italy,Vernaccia di San Gimignano,,87,$14.00,Tuscany,Guidi 1929 2015 Vernaccia di San Gimignano,Vernaccia,Guidi 1929
2,1929-01-01 00:00:00,Italy,Sangiovese di Romagna Superiore,Prugneto,84,$15.00,Central Italy,Poderi dal Nespoli 1929 2011 Prugneto (Sangiov...,Sangiovese,Poderi dal Nespoli 1929
3,1934-01-01 00:00:00,Portugal,,Reserva Velho,93,$495.00,Colares,Adega Viuva Gomes 1934 Reserva Velho Red (Cola...,Ramisco,Adega Viuva Gomes
4,1945-01-01 00:00:00,France,Rivesaltes,Legend Vintage,95,$350.00,Languedoc-Roussillon,Gérard Bertrand 1945 Legend Vintage Red (Rives...,Red Blend,Gérard Bertrand


### Eliminate unnecessary columns

In [3]:
new_df = df[['Country', 'Points', "Price", 'Province', 'Title', 'Variety', 'Winery']].copy()
new_df.head()

Unnamed: 0,Country,Points,Price,Province,Title,Variety,Winery
0,Spain,88,$13.00,Catalonia,L'Arboc NV 1919 Brut Selecció Sparkling (Cava),Sparkling Blend,L'Arboc
1,Italy,87,$14.00,Tuscany,Guidi 1929 2015 Vernaccia di San Gimignano,Vernaccia,Guidi 1929
2,Italy,84,$15.00,Central Italy,Poderi dal Nespoli 1929 2011 Prugneto (Sangiov...,Sangiovese,Poderi dal Nespoli 1929
3,Portugal,93,$495.00,Colares,Adega Viuva Gomes 1934 Reserva Velho Red (Cola...,Ramisco,Adega Viuva Gomes
4,France,95,$350.00,Languedoc-Roussillon,Gérard Bertrand 1945 Legend Vintage Red (Rives...,Red Blend,Gérard Bertrand


### Check to see if all rows have complete data

In [4]:
new_df.count()

Country     24989
Points      24997
Price       23375
Province    24989
Title       24997
Variety     24997
Winery      24997
dtype: int64

### Eliminate rows with incomplete data

In [5]:
new_df = new_df.dropna(how='any')
new_df.count()

Country     23367
Points      23367
Price       23367
Province    23367
Title       23367
Variety     23367
Winery      23367
dtype: int64

### Determine if all columns are set up in the correct data type

In [6]:
new_df.dtypes

Country     object
Points       int64
Price       object
Province    object
Title       object
Variety     object
Winery      object
dtype: object

### Note: price should be a float

### Convert the Price columnt to float

In [7]:

new_df['Price'] = [float(x.replace("$","").replace(",","")) for x in new_df['Price']]
new_df.dtypes

Country      object
Points        int64
Price       float64
Province     object
Title        object
Variety      object
Winery       object
dtype: object

### Isolate the top 6 wine types per https://artwinepreserver.com/: Cabernet Sauvignon, Chardonnay, Pinot Gris/Pinot Grigio, Pinot Noir, Sauvignon Blanc, Merlot (Pinot Gris/Pinot Grigio - Pinot Gris in this dataset)

In [8]:
winexl_df = new_df.loc[(new_df['Variety'] == "Cabernet Sauvignon") | (new_df['Variety'] == "Chardonnay") 
                       | (new_df['Variety'] == "Pinot Gris") | (new_df['Variety'] == "Pinot Noir")
                       | (new_df['Variety'] == "Sauvignon Blanc") | (new_df['Variety'] == "Merlot")]
count = winexl_df['Variety'].value_counts()
count

Pinot Noir            2552
Chardonnay            2167
Cabernet Sauvignon    1834
Sauvignon Blanc        986
Merlot                 587
Pinot Gris             274
Name: Variety, dtype: int64

In [9]:
winexl_df.head()

Unnamed: 0,Country,Points,Price,Province,Title,Variety,Winery
11,US,89,170.0,California,Sebastiani 1987 Cherryblock Cabernet Sauvignon...,Cabernet Sauvignon,Sebastiani
23,US,82,13.0,California,Gan Eden 1994 Chardonnay (Sonoma County),Chardonnay,Gan Eden
33,South Africa,87,17.0,Stellenbosch,Middelvlei 1995 Cabernet Sauvignon (Stellenbosch),Cabernet Sauvignon,Middelvlei
36,US,83,22.0,California,Meridian 1996 Coastal Reserve Cabernet Sauvign...,Cabernet Sauvignon,Meridian
37,US,84,29.0,Washington,Covey Run 1996 Whiskey Canyon Vyd Cabernet Sau...,Cabernet Sauvignon,Covey Run


In [10]:

winexl_df = winexl_df.reset_index(drop=True)
winexl_df.head()

Unnamed: 0,Country,Points,Price,Province,Title,Variety,Winery
0,US,89,170.0,California,Sebastiani 1987 Cherryblock Cabernet Sauvignon...,Cabernet Sauvignon,Sebastiani
1,US,82,13.0,California,Gan Eden 1994 Chardonnay (Sonoma County),Chardonnay,Gan Eden
2,South Africa,87,17.0,Stellenbosch,Middelvlei 1995 Cabernet Sauvignon (Stellenbosch),Cabernet Sauvignon,Middelvlei
3,US,83,22.0,California,Meridian 1996 Coastal Reserve Cabernet Sauvign...,Cabernet Sauvignon,Meridian
4,US,84,29.0,Washington,Covey Run 1996 Whiskey Canyon Vyd Cabernet Sau...,Cabernet Sauvignon,Covey Run


In [12]:
rds_connection_string = "<postgres>:<postgres>@localhost:5432/customer_db"
engine = create_engine(f'postgresql://{rds_connection_string}')

ModuleNotFoundError: No module named 'psycopg2'