# Capstone Project Part 2a: Cleaning "Data_DF"

**Authur:** Kate Meredith  

**Date:** September-November 2022

**Notebook #:** 2 of

## Background

**Source:** Data was collected from [CoffeeReview.com](https://www.coffeereview.com/) and grouped into two DataFrames for cleaning. See "Capstone Project Part 1" for more information on scraping. See notebook on cleaning "older_data_df" for cleaning of second portion of data.

**Initial Data Overview:** The following outlines the column headers for the data initially scraped and corresponding clean up plan:

Independent Variables:
- `coffee_name` 
    - def: name of coffee reviewed
    - options: use text vectoring to tokenize or drop; one-hot encoding likely impractical given number of variations in names
- `roaster_name` 
    - def: name of roaster that roasted the coffee
    - options: one-hot encoding or drop
- `roaster_location` 
    - def: location of coffee roaster
    - turn into latitude and longitude
- `coffee_origin` 
    - def: location of coffee bean origination
    - turn into latitude and longitude
- `roast_level` 
    - def: describes how long and thoroughly the beans are cooked, ranging from light to dark
    - options: encode numerically (ordinal values) or drop; agtron scores measure roast level with number so may drop this
- `agtron` 
    - def: numerical measure of how roasted the beans are; first number is taken from measuring the whole beans, second from the grounds
    - options: split into two columns, one for `bean_agtron`, one for `ground_agtron`
- `est_price`
    - def: est cost of the coffee
    - options: clean up and turn into measure that can be compared (such as USD cost per ounce) or drop; dropping may be necessary for sake of time given how complex clean up is (different measurement amounts and inflation affecting cost (data spans 1997 to 2022)
- `h1`
    - def: placedholder for first header
    - these headers will become column headers in the DataFrame; webpage format was inconsistent, so pulled headers in one column, followed by score in next so that scores can be correctly attributed to the right header
- `s1`
    - def: first score listed on webpage
    - webpage format was inconsistent; need to verify which header the score corresponds to and move to the appropriate new column
- `h2`
    - def: placedholder for second header
    - see header above
- `s2`
    - def: second score listed on webpage
    - see above
- `h3`
    - def: placedholder for third header
    - see header above
- `s3`
    - def: third score listed on webpage
    - see above
- `h4`
    - def: placedholder for fourth header
    - see header above
- `s4`
    - def: fourth score listed on webpage
    - see above
- `h5`
    - def: placedholder for fifth header
    - see header above
- `s5`
    - def: fifth score listed on webpage
    - see above
- `p1`
    - def: first paragraph scraped from webpage
    - verify if paragraphs all report same type of information; then use text vecotrizing to tokenize
- `p2`
    - def: second paragraph scraped from webpage
    - verify if paragraphs all report same type of information; then use text vecotrizing to tokenize
- `p3`
    - def: third paragraph scraped from webpage
    - verify if paragraphs all report same type of information; then use text vecotrizing to tokenize; expect there to be some difference in type of content in paragraph 3 as format changed over the years

Target Variable:
- `overall_score`
    - def: overall score awarded to coffee
    - verify data type is numeric, should not need altering otherwise

## References

- Referenced this [article](https://www.geeksforgeeks.org/split-a-text-column-into-two-columns-in-pandas-dataframe/) for splitting columns and keeping in dataframe 
- For [splitting columns with nans](https://stackoverflow.com/questions/69354795/how-to-skip-nan-values-when-splitting-up-a-column)
- References for [cleaning up measurement columns](https://www.geeksforgeeks.org/how-to-replace-values-in-column-based-on-condition-in-pandas/)
- Used [Regex 101](https://regex101.com/) to verify regex replacements
- Used these articles to address turning datatypes to integers with nan present:
    - [Changing to nans](https://stackoverflow.com/questions/34794067/how-to-set-a-cell-to-nan-in-a-pandas-dataframe)
    - [Using numpy to allow nans in integers](https://stackoverflow.com/questions/21287624/convert-pandas-column-containing-nans-to-dtype-int)
- Used this stack overflow [article](https://stackoverflow.com/questions/25888396/how-to-get-latitude-longitude-with-python) to acquire lat and long for locations

In [1]:
#Importing libraries.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
#import first dataset "data_df"
data_df = pd.read_csv('data_df.csv')

## EDA

Before beginning cleaning, exploring the data to check out scope of cleaning needs.

In [3]:
#checking data shape
data_df.shape

(4779, 22)

`data_df` has 4,779 rows and 22 columns.

In [4]:
#checking datatypes, null values
data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4779 entries, 0 to 4778
Data columns (total 22 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   coffee_name       4779 non-null   object
 1   roaster_name      4779 non-null   object
 2   roaster_location  4779 non-null   object
 3   coffee_origin     4779 non-null   object
 4   roast_level       4705 non-null   object
 5   agtron            4779 non-null   object
 6   est_price         4731 non-null   object
 7   review_date       4779 non-null   object
 8   overall_score     4779 non-null   int64 
 9   h1                4779 non-null   object
 10  s1                4779 non-null   int64 
 11  h2                4779 non-null   object
 12  s2                4779 non-null   int64 
 13  h3                4779 non-null   object
 14  s3                4779 non-null   int64 
 15  h4                4779 non-null   object
 16  s4                4779 non-null   int64 
 17  h5            

Much of the data imported as object, will need to determine plan for making numeric.

In [5]:
#counting null values
#there are a few, but not too many
data_df.isna().sum()

coffee_name          0
roaster_name         0
roaster_location     0
coffee_origin        0
roast_level         74
agtron               0
est_price           48
review_date          0
overall_score        0
h1                   0
s1                   0
h2                   0
s2                   0
h3                   0
s3                   0
h4                   0
s4                   0
h5                   0
s5                   0
p1                   0
p2                   0
p3                   6
dtype: int64

In [6]:
#previewing data
data_df.head()

Unnamed: 0,coffee_name,roaster_name,roaster_location,coffee_origin,roast_level,agtron,est_price,review_date,overall_score,h1,...,s2,h3,s3,h4,s4,h5,s5,p1,p2,p3
0,Colombia Cerro Azul Enano,Equator Coffees,"San Rafael, California","Trujillo, Valle del Cauca Department, Colombia",Medium-Light,60/77,$26.00/12 ounces,October 2022,94,Aroma:,...,9,Body:,9,Flavor:,9,Aftertaste:,8,Blind Assessment: Elegantly fruit- and cocoa-t...,Notes: Produced at Finca Cerro Azul (also owne...,The Bottom Line: This rare Enano (dwarf Geish...
1,Peru Incahuasi,Press Coffee,"Phoenix, Arizona","Cusco Region, Peru",Medium-Light,58/78,$26.00/12 ounces,October 2022,94,Aroma:,...,9,Body:,9,Flavor:,9,Aftertaste:,8,"Blind Assessment: Gently fruit-toned, integrat...",Notes: Produced at Incahuasi Farm from trees o...,The Bottom Line: Laden with tropical fruit not...
2,Colombia Aponte’s Guardians,Press Coffee,"Phoenix, Arizona","Nariño Department, Colombia",Medium-Light,59/77,$19.00/12 ounces,October 2022,93,Aroma:,...,9,Body:,8,Flavor:,9,Aftertaste:,8,"Blind Assessment: Richly sweet, spice-toned. L...",Notes: Produced at Aponte Farm from an undiscl...,"The Bottom Line: A balanced, inviting washed C..."
3,Nicaragua Flor de Dalia Natural,Equator Coffees,"San Rafael, California","La Dalia, Matagalpa Department, northern Nicar...",Medium-Light,62/78,$18.00/12 ounces,October 2022,92,Aroma:,...,8,Body:,9,Flavor:,9,Aftertaste:,8,"Blind Assessment: Gently fruit-forward, sweetl...",Notes: Produced by smallholding members of the...,"The Bottom Line: A refreshing, very sweet natu..."
4,Ethiopia Bench Maji Geisha G1 Natural,Taster's Coffee,"New Taipei City, Taiwan","Bench-Maji Zone, southern Ethiopia",Light,65/81,NT $520/227 grams,October 2022,93,Aroma:,...,9,Body:,8,Flavor:,9,Aftertaste:,8,"Blind Assessment: Gently sweet-tart, floral-to...",Notes: Produced from trees of the admired bota...,The Bottom Line: A quietly confident natural-p...


### Checking Values in Each Column

Checking out the types of values appearing in each column to verify if they make sense, any issues.

In [7]:
#coffee name value counts
data_df['coffee_name'].value_counts()

Holiday Blend                                      21
Espresso Blend                                     20
Ethiopia Yirgacheffe                               11
Sumatra Tano Batak                                  9
Ethiopia Hambela Natural                            8
                                                   ..
Kenya Handege                                       1
Kauai Estate Reserve Sun-Dried Typica               1
Colombia Valle del Cauca Cerro Azul Geisha          1
Ecuador Loja Clara Sidra Natural Taza Dorada #2     1
Rwanda Coopac Cooperative                           1
Name: coffee_name, Length: 4231, dtype: int64

In [8]:
#roaster name value counts
data_df['roaster_name'].value_counts()

JBC Coffee Roasters          291
Paradise Roasters            230
Kakalove Cafe                170
Temple Coffee and Tea        103
Bird Rock Coffee Roasters     92
                            ... 
Boil Line Coffee Company       1
Red Giant Coffee Roasters      1
Pigeonhole Coffee              1
Red E Café                     1
Moonstruck Farm                1
Name: roaster_name, Length: 870, dtype: int64

In [9]:
#roast location value counts
data_df['roaster_location'].value_counts()

Madison, Wisconsin        297
Sacramento, California    201
Chia-Yi, Taiwan           175
San Diego, California     166
Minneapolis, Minnesota    157
                         ... 
San Juan, Puerto Rico       1
Tempe Arizona               1
Lexington, Kentucky         1
St. Louis, MIssouri         1
Pa'auilo, Hawaii            1
Name: roaster_location, Length: 499, dtype: int64

In [10]:
#coffee origin value counts
data_df['coffee_origin'].value_counts()

Not disclosed.                                                           156
Yirgacheffe growing region, southern Ethiopia                            147
Yirgacheffe growing region, southern Ethiopia.                           109
Boquete growing region, western Panama                                   100
Nyeri growing region, south-central Kenya                                 99
                                                                        ... 
La Libertad, Huehuetenango, Guatemala                                      1
Cusco Department, La Convención Province, Santa Teresa District, Peru      1
San Cristobal, Cobán, Alta Verapaz, Guatemala                              1
Jamaica                                                                    1
Nilgiris, southern India.                                                  1
Name: coffee_origin, Length: 1794, dtype: int64

In [11]:
#roast level value counts
data_df['roast_level'].value_counts()

Medium-Light    2592
Medium          1067
Light            584
Medium-Dark      299
Very Dark        118
Dark              45
Name: roast_level, dtype: int64

In [12]:
#agtron value counts
data_df['agtron'].value_counts()

58/76     133
58/78     114
60/78     108
62/80      84
56/78      81
         ... 
64/94       1
72/102      1
39/47       1
63/88       1
49/78       1
Name: agtron, Length: 764, dtype: int64

In [13]:
#est_price value counts
data_df['est_price'].value_counts()

$18.00/12 ounces                          154
$16.00/12 ounces                           95
$20.00/12 ounces                           91
$19.00/12 ounces                           87
$15.00/12 ounces                           84
                                         ... 
$9.99/7 ounces (198 grams)                  1
$19.00/six 5-gram packets                   1
$15.00/six 5-gram single-serve packets      1
$22.00/eight 5-gram tubes                   1
$14.25 / 12 oz.                             1
Name: est_price, Length: 1537, dtype: int64

In [14]:
#review_date value counts
data_df['review_date'].value_counts()

November 2021    66
March 2021       63
August 2021      63
October 2020     61
August 2020      61
                 ..
January 2017      8
December 2009     3
October 2009      1
August 2009       1
June 2009         1
Name: review_date, Length: 156, dtype: int64

In [15]:
#overall_score value counts
data_df['overall_score'].value_counts()

93    1191
92     914
94     873
91     470
95     414
90     388
89     129
96     115
88     104
87      55
97      28
86      24
85      17
84      12
83       7
79       6
77       4
75       3
73       3
80       3
98       3
67       3
68       2
72       2
78       2
63       1
82       1
71       1
74       1
76       1
66       1
69       1
Name: overall_score, dtype: int64

In [16]:
#h1 value counts
data_df['h1'].value_counts()

Aroma:                4743
Acidity/Structure:      24
Acidity:                12
Name: h1, dtype: int64

In [17]:
#s1 value counts
data_df['s1'].value_counts()

9     3207
8     1322
7      121
10      95
6       15
5        9
4        5
2        3
3        2
Name: s1, dtype: int64

In [18]:
#h2 value counts
data_df['h2'].value_counts()

Acidity:              2078
Acidity/Structure:    1935
Body:                  766
Name: h2, dtype: int64

In [19]:
#s2 value counts
data_df['s2'].value_counts()

8     2437
9     1952
7      327
6       28
10      15
4        6
3        5
5        5
1        2
2        2
Name: s2, dtype: int64

In [20]:
#h3 value counts
data_df['h3'].value_counts()

Body:      4013
Flavor:     766
Name: h3, dtype: int64

In [21]:
#s3 value counts
data_df['s3'].value_counts()

9     2436
8     2165
7      143
6       18
10      15
5        2
Name: s3, dtype: int64

In [22]:
#h4 value counts
data_df['h4'].value_counts()

Flavor:        4013
Aftertaste:     766
Name: h4, dtype: int64

In [23]:
#s4 value counts
data_df['s4'].value_counts()

9     3278
8     1085
10     213
7      164
6       11
5        9
3        8
4        8
2        2
1        1
Name: s4, dtype: int64

In [24]:
#h5 value counts
data_df['h5'].value_counts()

Aftertaste:    4013
With Milk:      766
Name: h5, dtype: int64

In [25]:
#s5 value counts
data_df['s5'].value_counts()

8     2975
9     1220
7      494
10      31
6       29
5       12
4       10
3        6
2        2
Name: s5, dtype: int64

## Cleaning Data

Cleaning steps for this notebook:
- Separate `agtron` score in two columns. The first number refers to `bean_agtron` and the second to`ground_agtron`; these scores measure the roast level in these two forms.
- Update non NLP features to numerical values datatypes, particularly integer where appropriate.
- To get all values, scraped the value header and value. Some of these are in the incorrect columns. Sort these out so that the headers currently listed as values become column headers, and corresponding scores are in the correct column.
- Replace location data with lat and lon.
- Separate review date into two numerical columns for month and year.

Note: In hindsight, it would have been more efficient to sort out the jumbled column values for both datasets first, and then clean aspects like data type. However, much of the cleaning was done on each dataframe instead before they were combined.

### Splitting agtron into two columns, one for whole bean score and one for ground score

- The first number refers to the bean agtron score
- The second number refers to the ground agtron score

In [26]:
#split agtron into 2 columns
data_df[['bean_agtron','ground_agtron']] = data_df['agtron'].str.split("/",expand=True)

In [27]:
#verify split
data_df.head()

Unnamed: 0,coffee_name,roaster_name,roaster_location,coffee_origin,roast_level,agtron,est_price,review_date,overall_score,h1,...,s3,h4,s4,h5,s5,p1,p2,p3,bean_agtron,ground_agtron
0,Colombia Cerro Azul Enano,Equator Coffees,"San Rafael, California","Trujillo, Valle del Cauca Department, Colombia",Medium-Light,60/77,$26.00/12 ounces,October 2022,94,Aroma:,...,9,Flavor:,9,Aftertaste:,8,Blind Assessment: Elegantly fruit- and cocoa-t...,Notes: Produced at Finca Cerro Azul (also owne...,The Bottom Line: This rare Enano (dwarf Geish...,60,77
1,Peru Incahuasi,Press Coffee,"Phoenix, Arizona","Cusco Region, Peru",Medium-Light,58/78,$26.00/12 ounces,October 2022,94,Aroma:,...,9,Flavor:,9,Aftertaste:,8,"Blind Assessment: Gently fruit-toned, integrat...",Notes: Produced at Incahuasi Farm from trees o...,The Bottom Line: Laden with tropical fruit not...,58,78
2,Colombia Aponte’s Guardians,Press Coffee,"Phoenix, Arizona","Nariño Department, Colombia",Medium-Light,59/77,$19.00/12 ounces,October 2022,93,Aroma:,...,8,Flavor:,9,Aftertaste:,8,"Blind Assessment: Richly sweet, spice-toned. L...",Notes: Produced at Aponte Farm from an undiscl...,"The Bottom Line: A balanced, inviting washed C...",59,77
3,Nicaragua Flor de Dalia Natural,Equator Coffees,"San Rafael, California","La Dalia, Matagalpa Department, northern Nicar...",Medium-Light,62/78,$18.00/12 ounces,October 2022,92,Aroma:,...,9,Flavor:,9,Aftertaste:,8,"Blind Assessment: Gently fruit-forward, sweetl...",Notes: Produced by smallholding members of the...,"The Bottom Line: A refreshing, very sweet natu...",62,78
4,Ethiopia Bench Maji Geisha G1 Natural,Taster's Coffee,"New Taipei City, Taiwan","Bench-Maji Zone, southern Ethiopia",Light,65/81,NT $520/227 grams,October 2022,93,Aroma:,...,8,Flavor:,9,Aftertaste:,8,"Blind Assessment: Gently sweet-tart, floral-to...",Notes: Produced from trees of the admired bota...,The Bottom Line: A quietly confident natural-p...,65,81


In [28]:
#drop unneeded agtron column
data_df.drop('agtron', axis=1, inplace=True)

In [29]:
#checking drop
data_df.head()

Unnamed: 0,coffee_name,roaster_name,roaster_location,coffee_origin,roast_level,est_price,review_date,overall_score,h1,s1,...,s3,h4,s4,h5,s5,p1,p2,p3,bean_agtron,ground_agtron
0,Colombia Cerro Azul Enano,Equator Coffees,"San Rafael, California","Trujillo, Valle del Cauca Department, Colombia",Medium-Light,$26.00/12 ounces,October 2022,94,Aroma:,9,...,9,Flavor:,9,Aftertaste:,8,Blind Assessment: Elegantly fruit- and cocoa-t...,Notes: Produced at Finca Cerro Azul (also owne...,The Bottom Line: This rare Enano (dwarf Geish...,60,77
1,Peru Incahuasi,Press Coffee,"Phoenix, Arizona","Cusco Region, Peru",Medium-Light,$26.00/12 ounces,October 2022,94,Aroma:,9,...,9,Flavor:,9,Aftertaste:,8,"Blind Assessment: Gently fruit-toned, integrat...",Notes: Produced at Incahuasi Farm from trees o...,The Bottom Line: Laden with tropical fruit not...,58,78
2,Colombia Aponte’s Guardians,Press Coffee,"Phoenix, Arizona","Nariño Department, Colombia",Medium-Light,$19.00/12 ounces,October 2022,93,Aroma:,9,...,8,Flavor:,9,Aftertaste:,8,"Blind Assessment: Richly sweet, spice-toned. L...",Notes: Produced at Aponte Farm from an undiscl...,"The Bottom Line: A balanced, inviting washed C...",59,77
3,Nicaragua Flor de Dalia Natural,Equator Coffees,"San Rafael, California","La Dalia, Matagalpa Department, northern Nicar...",Medium-Light,$18.00/12 ounces,October 2022,92,Aroma:,8,...,9,Flavor:,9,Aftertaste:,8,"Blind Assessment: Gently fruit-forward, sweetl...",Notes: Produced by smallholding members of the...,"The Bottom Line: A refreshing, very sweet natu...",62,78
4,Ethiopia Bench Maji Geisha G1 Natural,Taster's Coffee,"New Taipei City, Taiwan","Bench-Maji Zone, southern Ethiopia",Light,NT $520/227 grams,October 2022,93,Aroma:,9,...,8,Flavor:,9,Aftertaste:,8,"Blind Assessment: Gently sweet-tart, floral-to...",Notes: Produced from trees of the admired bota...,The Bottom Line: A quietly confident natural-p...,65,81


In [30]:
#checking data types for agtron scores
data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4779 entries, 0 to 4778
Data columns (total 23 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   coffee_name       4779 non-null   object
 1   roaster_name      4779 non-null   object
 2   roaster_location  4779 non-null   object
 3   coffee_origin     4779 non-null   object
 4   roast_level       4705 non-null   object
 5   est_price         4731 non-null   object
 6   review_date       4779 non-null   object
 7   overall_score     4779 non-null   int64 
 8   h1                4779 non-null   object
 9   s1                4779 non-null   int64 
 10  h2                4779 non-null   object
 11  s2                4779 non-null   int64 
 12  h3                4779 non-null   object
 13  s3                4779 non-null   int64 
 14  h4                4779 non-null   object
 15  s4                4779 non-null   int64 
 16  h5                4779 non-null   object
 17  s5            

### Turning new agtron columns into numerical values

In [31]:
#checking values to see which need updated to allow switch to integer
data_df['bean_agtron'].unique()

array(['60', '58', '59', '62', '65', '46', '57', '56', '48', '64', '52',
       '61', '54', '45', '76', '68', '51', '55', '63', '44', '47', '50',
       '0', '75', '40', '38', '32', '34', '53', '49', '42', '66', '36',
       '35', '39', '82', '74', '85', '79', '5252', '547', '555', '67',
       '43', '41', '31', '69', 'NA', '37', '', '71', '70', '86', '78',
       '72', '81', '73', '33', '21', '20', '25', '29', '27', '22'],
      dtype=object)

In [32]:
#turning NA into nans
data_df['bean_agtron'].replace('NA', np.nan, inplace=True)

In [33]:
#turning blank into nans
data_df['bean_agtron'].replace('', np.nan, inplace=True)

In [34]:
#converting to integer
data_df['bean_agtron'] = data_df['bean_agtron'].astype('Int8')

In [35]:
#checking values to see which need updated to allow switch to integer
data_df['ground_agtron'].unique()

array(['77', '78', '81', '74', '82', '72', '67', '84', '86', '70', '80',
       '79', '76', '69', '66', '85', '92', '71', '88', '87', '75', '60',
       '61', '73', '0', '64', '62', '93', '58', '54', '50', '52', '56',
       '53', '44', '68', '65', '57', '83', '48', '63', '59', '104', '94',
       '105', '90', '96', '55', '46', '49', 'NA', '45', '', '91', '37',
       '99', '97', '89', '38', '51', '47', '40', '95', '98', '29', '34',
       '41', '43', '102', '39', '101', '42', '33', '36', '28', '30', '31',
       '32', '26', '18'], dtype=object)

In [36]:
#turning NA into nans
data_df['ground_agtron'].replace('NA', np.nan, inplace=True)

In [37]:
#turning blank into nans
data_df['ground_agtron'].replace('', np.nan, inplace=True)

In [38]:
#converting to integer
data_df['ground_agtron'] = data_df['ground_agtron'].astype('Int8')

In [39]:
#verifying data type updates
data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4779 entries, 0 to 4778
Data columns (total 23 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   coffee_name       4779 non-null   object
 1   roaster_name      4779 non-null   object
 2   roaster_location  4779 non-null   object
 3   coffee_origin     4779 non-null   object
 4   roast_level       4705 non-null   object
 5   est_price         4731 non-null   object
 6   review_date       4779 non-null   object
 7   overall_score     4779 non-null   int64 
 8   h1                4779 non-null   object
 9   s1                4779 non-null   int64 
 10  h2                4779 non-null   object
 11  s2                4779 non-null   int64 
 12  h3                4779 non-null   object
 13  s3                4779 non-null   int64 
 14  h4                4779 non-null   object
 15  s4                4779 non-null   int64 
 16  h5                4779 non-null   object
 17  s5            

### Turn Roast_Level to Ordinal Values

Turning roast level to integers, ranging from 1-6, from Light to Very Dark.

In [40]:
#checking value options
data_df['roast_level'].unique()

array(['Medium-Light', 'Light', 'Medium', nan, 'Medium-Dark', 'Dark',
       'Very Dark'], dtype=object)

In [41]:
#mapping values to numbers
roast_mapper = {'Light':1, 'Medium-Light':2, 'Medium':3, 'Medium-Dark':4, 'Dark':5, 'Very Dark':6}

In [42]:
#updating values
data_df['roast_level'] = data_df['roast_level'].replace(roast_mapper)

In [43]:
#changing to integers
data_df['roast_level'] = data_df['roast_level'].astype('Int8')

In [44]:
#checking update
data_df.sample(10)

Unnamed: 0,coffee_name,roaster_name,roaster_location,coffee_origin,roast_level,est_price,review_date,overall_score,h1,s1,...,s3,h4,s4,h5,s5,p1,p2,p3,bean_agtron,ground_agtron
1054,Tigesit Waqa,Bird Rock Coffee Roasters,"San Diego, California","Yirgacheffe growing region, south-Central Ethi...",2,$20.00/12 ounces,November 2020,93,Aroma:,9,...,9,Flavor:,9,Aftertaste:,8,"Blind Assessment: Richly fruit-forward, chocol...",Notes: Southern Ethiopia coffees like this one...,"The Bottom Line: Concentrated, berry-driven na...",56,74
4306,Kenya Gichatha-Ini,Bird Rock Coffee Roasters,"La Jolla, California","Central Province, Kenya.",3,$22.00/12 ounces,July 2011,92,Aroma:,9,...,9,Flavor:,8,Aftertaste:,8,"Blind Assessment: Deep, rich, balanced. Pungen...",Notes: Produced mainly from trees of the admir...,Who Should Drink It: The pungent Kenya fruit i...,47,64
4109,House Blend (K-Cup),Tully's Coffee,"Seattle, Washington",Not disclosed.,6,$16.49/24 K-Cups,March 2012,87,Aroma:,7,...,8,Flavor:,8,Aftertaste:,7,Blind Assessment: (As brewed in a Keurig Plati...,Notes: Certified organically grown and Fair Tr...,Who Should Drink It: A very nicely balanced bl...,0,36
3738,Geisha P.E.B. (RFA),The Coffee Academics,"Hong Kong, China","Boquete growing region, western Panama",2,HK $588/100 grams,April 2013,94,Aroma:,9,...,8,Flavor:,9,Aftertaste:,9,Blind Assessment: Intensely and extravagantly ...,Notes: Produced from trees of the rare Ethiopi...,Who Should Drink It: An opportunity to sample ...,56,83
3447,Peaberry Reserve,Flight Coffee Co.,"Bedford, New Hampshire",Brazil; Kenya; Tanzania.,3,$14.50/12 ounces,February 2014,92,Aroma:,9,...,8,Flavor:,9,Aftertaste:,8,"Blind Assessment: Quietly balanced, complex. B...","Notes: This blend combines, with originality a...",Who Should Drink It: Those who enjoy white win...,50,70
2098,Bufcafe Rwanda Espresso,JBC Coffee Roasters,"Madison, Wisconsin","Nyamagabe district, Southern Province, Rwanda",2,$17.25/12 ounces,January 2018,93,Aroma:,9,...,9,Aftertaste:,7,With Milk:,9,Blind Assessment: Evaluated as espresso. Delic...,Notes: Produced from trees of the heirloom Bou...,The Bottom Line: An unapologetically savory-sw...,54,78
2595,SFCC House Blend Espresso Capsule,Gourmesso,"Berlin, Germany",Central America,6,$5.49/10 capsules,July 2016,87,Aroma:,7,...,8,Aftertaste:,7,With Milk:,7,Blind Assessment: Evaluated as espresso produc...,"Notes: This coffee is Fair Trade certified, me...","The Bottom Line: A crisp, gentle espresso with...",0,58
581,Tanzania Ngila Estate,Big Shoulders Coffee,"Chicago, Illinois","Ngila Estate, Ngorongoro Conservation Area, Ta...",2,$25.00/12 ounces,September 2021,93,Aroma:,9,...,9,Flavor:,9,Aftertaste:,8,Blind Assessment: Vibrantly sweet-savory. Red ...,Notes: Produced at Ngila Estate from trees of ...,"The Bottom Line: A rich-toned, complex, nuance...",60,78
2932,Kenya AA Kigwandi Estate,Willoughby's Coffee & Tea,"Branford, Connecticut","Nyeri County, Central Highlands, Kenya.",2,$17.99/16 ounces,September 2015,95,Aroma:,9,...,9,Flavor:,9,Aftertaste:,9,Blind Assessment: Sweetly and deeply pungent. ...,Notes: This exceptional coffee was selected as...,Who Should Drink It: A splendid and very chara...,59,75
3849,El Salvador Finca Malacara B,Fratello Coffee Roasters,"Calgary, Alberta, Canada","Volcan de Santa Ana regioin, El Salvador.",2,CAD $15.60/12 ounces,January 2013,90,Aroma:,7,...,8,Flavor:,9,Aftertaste:,8,"Blind Assessment: Soft, round, deep. Caramel, ...",Notes: Produced entirely from trees of the hei...,Who Should Drink It: Those who prefer quiet de...,61,75


### Addressing the Estimated Price Column

Background: Estimated price data uses a number of different measures (ounces, grams, bottles, etc). Original plan was to break it apart with goal of converting to same measure so that they can be compared in a meaningful way. This would involve modifying data to get a price per ounce. In addition, the reviews span from 1997 to 2022 and it is unclear if the pricing data gets updated.

Given how complex it is to turn this column into usable data, the column is being dropped for now. 

In [45]:
#drop est_price column
data_df.drop('est_price', axis=1, inplace=True)

In [46]:
data_df.head()

Unnamed: 0,coffee_name,roaster_name,roaster_location,coffee_origin,roast_level,review_date,overall_score,h1,s1,h2,...,s3,h4,s4,h5,s5,p1,p2,p3,bean_agtron,ground_agtron
0,Colombia Cerro Azul Enano,Equator Coffees,"San Rafael, California","Trujillo, Valle del Cauca Department, Colombia",2,October 2022,94,Aroma:,9,Acidity/Structure:,...,9,Flavor:,9,Aftertaste:,8,Blind Assessment: Elegantly fruit- and cocoa-t...,Notes: Produced at Finca Cerro Azul (also owne...,The Bottom Line: This rare Enano (dwarf Geish...,60,77
1,Peru Incahuasi,Press Coffee,"Phoenix, Arizona","Cusco Region, Peru",2,October 2022,94,Aroma:,9,Acidity/Structure:,...,9,Flavor:,9,Aftertaste:,8,"Blind Assessment: Gently fruit-toned, integrat...",Notes: Produced at Incahuasi Farm from trees o...,The Bottom Line: Laden with tropical fruit not...,58,78
2,Colombia Aponte’s Guardians,Press Coffee,"Phoenix, Arizona","Nariño Department, Colombia",2,October 2022,93,Aroma:,9,Acidity/Structure:,...,8,Flavor:,9,Aftertaste:,8,"Blind Assessment: Richly sweet, spice-toned. L...",Notes: Produced at Aponte Farm from an undiscl...,"The Bottom Line: A balanced, inviting washed C...",59,77
3,Nicaragua Flor de Dalia Natural,Equator Coffees,"San Rafael, California","La Dalia, Matagalpa Department, northern Nicar...",2,October 2022,92,Aroma:,8,Acidity/Structure:,...,9,Flavor:,9,Aftertaste:,8,"Blind Assessment: Gently fruit-forward, sweetl...",Notes: Produced by smallholding members of the...,"The Bottom Line: A refreshing, very sweet natu...",62,78
4,Ethiopia Bench Maji Geisha G1 Natural,Taster's Coffee,"New Taipei City, Taiwan","Bench-Maji Zone, southern Ethiopia",1,October 2022,93,Aroma:,9,Acidity/Structure:,...,8,Flavor:,9,Aftertaste:,8,"Blind Assessment: Gently sweet-tart, floral-to...",Notes: Produced from trees of the admired bota...,The Bottom Line: A quietly confident natural-p...,65,81


### Turning `review_date` into datetime type and extracting month and year.

In [47]:
data_df['review_date'] = pd.to_datetime(data_df['review_date'])

In [48]:
data_df['year'] = pd.DatetimeIndex(data_df['review_date']).year

In [49]:
data_df['month'] = pd.DatetimeIndex(data_df['review_date']).month

In [50]:
#drop unneeded agtron column
data_df.drop('review_date', axis=1, inplace=True)

In [51]:
data_df.head()

Unnamed: 0,coffee_name,roaster_name,roaster_location,coffee_origin,roast_level,overall_score,h1,s1,h2,s2,...,s4,h5,s5,p1,p2,p3,bean_agtron,ground_agtron,year,month
0,Colombia Cerro Azul Enano,Equator Coffees,"San Rafael, California","Trujillo, Valle del Cauca Department, Colombia",2,94,Aroma:,9,Acidity/Structure:,9,...,9,Aftertaste:,8,Blind Assessment: Elegantly fruit- and cocoa-t...,Notes: Produced at Finca Cerro Azul (also owne...,The Bottom Line: This rare Enano (dwarf Geish...,60,77,2022,10
1,Peru Incahuasi,Press Coffee,"Phoenix, Arizona","Cusco Region, Peru",2,94,Aroma:,9,Acidity/Structure:,9,...,9,Aftertaste:,8,"Blind Assessment: Gently fruit-toned, integrat...",Notes: Produced at Incahuasi Farm from trees o...,The Bottom Line: Laden with tropical fruit not...,58,78,2022,10
2,Colombia Aponte’s Guardians,Press Coffee,"Phoenix, Arizona","Nariño Department, Colombia",2,93,Aroma:,9,Acidity/Structure:,9,...,9,Aftertaste:,8,"Blind Assessment: Richly sweet, spice-toned. L...",Notes: Produced at Aponte Farm from an undiscl...,"The Bottom Line: A balanced, inviting washed C...",59,77,2022,10
3,Nicaragua Flor de Dalia Natural,Equator Coffees,"San Rafael, California","La Dalia, Matagalpa Department, northern Nicar...",2,92,Aroma:,8,Acidity/Structure:,8,...,9,Aftertaste:,8,"Blind Assessment: Gently fruit-forward, sweetl...",Notes: Produced by smallholding members of the...,"The Bottom Line: A refreshing, very sweet natu...",62,78,2022,10
4,Ethiopia Bench Maji Geisha G1 Natural,Taster's Coffee,"New Taipei City, Taiwan","Bench-Maji Zone, southern Ethiopia",1,93,Aroma:,9,Acidity/Structure:,9,...,9,Aftertaste:,8,"Blind Assessment: Gently sweet-tart, floral-to...",Notes: Produced from trees of the admired bota...,The Bottom Line: A quietly confident natural-p...,65,81,2022,10


In [52]:
#verifying modified data types are numerical
data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4779 entries, 0 to 4778
Data columns (total 23 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   coffee_name       4779 non-null   object
 1   roaster_name      4779 non-null   object
 2   roaster_location  4779 non-null   object
 3   coffee_origin     4779 non-null   object
 4   roast_level       4705 non-null   Int8  
 5   overall_score     4779 non-null   int64 
 6   h1                4779 non-null   object
 7   s1                4779 non-null   int64 
 8   h2                4779 non-null   object
 9   s2                4779 non-null   int64 
 10  h3                4779 non-null   object
 11  s3                4779 non-null   int64 
 12  h4                4779 non-null   object
 13  s4                4779 non-null   int64 
 14  h5                4779 non-null   object
 15  s5                4779 non-null   int64 
 16  p1                4779 non-null   object
 17  p2            

### Turn locations into latitude and longitude

In [53]:
#using open street map to get lat and long for roaster_location

import requests
import urllib.parse

lat = []
lon = []

address_list = data_df['roaster_location'].to_numpy()

for address in address_list:
    url = 'https://nominatim.openstreetmap.org/search/' + urllib.parse.quote(address) + '?format=json'

    try:
        response = requests.get(url).json()
        lat.append(response[0]["lat"])
        lon.append(response[0]["lon"])
    except IndexError:
        lat.append('')
        lon.append('')

print(lat)
print(lon)

['37.9735346', '33.4484367', '33.4484367', '37.9735346', '25.072134249999998', '43.074761', '24.7519538', '24.7519538', '37.7840208', '37.7840208', '25.0375198', '25.0375198', '25.0375198', '19.6238149', '32.7174202', '24.9929995', '41.1035786', '32.7174202', '44.9772995', '44.9772995', '25.072134249999998', '25.011997', '44.9772995', '25.011997', '19.557849', '19.557849', '23.4871369', '38.5810606', '23.4871369', '23.4871369', '38.5810606', '23.4871369', '23.4871369', '38.5810606', '45.7874957', '30.3466423', '45.5202471', '24.8066333', '43.6610277', '43.5569174', '32.7174202', '33.1216751', '42.291707', '43.074761', '38.7509488', '32.7174202', '43.074761', '38.7509488', '38.7509488', '38.7509488', '19.6238149', '37.2769484', '24.7519538', '32.7174202', '24.163162', '22.6828017', '22.6828017', '32.2228765', '25.0452747', '25.0452747', '43.074761', '43.074761', '19.6238149', '45.6794293', '36.9157056', '32.7174202', '32.7174202', '37.7840208', '37.7840208', '37.7840208', '25.0375198', 

In [54]:
#checking lengths to support adding into dataframe
len(lat)

4779

In [55]:
len(lon)

4779

In [56]:
len(data_df)

4779

In [57]:
#adding column to dataframe
data_df['roaster_location_lat'] = lat

In [58]:
#adding column to dataframe
data_df['roaster_location_lon'] = lon

In [59]:
#drop original roaster_location being replaced with lat and lon
data_df.drop('roaster_location', axis=1, inplace=True)

In [60]:
#spot checking accuracy of lat locations
data_df.sample(10)

Unnamed: 0,coffee_name,roaster_name,coffee_origin,roast_level,overall_score,h1,s1,h2,s2,h3,...,s5,p1,p2,p3,bean_agtron,ground_agtron,year,month,roaster_location_lat,roaster_location_lon
859,Twisted V.6 Espresso,JBC Coffee Roasters,South America; Central America; Africa,2.0,94,Aroma:,9,Body:,9,Flavor:,...,9,Blind Assessment: Evaluated as espresso. Choco...,"Notes: A blend of coffees from South America, ...",The Bottom Line: This house-blend espresso is...,54,74,2021,3,43.074761,-89.3837613
302,Twisted V.7 Espresso,JBC Coffee Roasters,Central America; South America; Ethiopia,2.0,93,Aroma:,9,Body:,8,Flavor:,...,9,Blind Assessment: Evaluated as espresso. Pista...,"Notes: A blend of coffees from South America, ...",The Bottom Line: A very appealing and affordab...,52,74,2022,3,43.074761,-89.3837613
2747,Congo Sopacdi,Red Rooster Coffee Roaster,"Kalehe, South Kivu Province, Democratic Republ...",1.0,91,Aroma:,8,Acidity:,8,Body:,...,8,Blind Assessment: Delicately sweet and spicy. ...,"Notes: Last year, this exceptional coffee was ...",Who Should Drink It: Those who appreciate a ba...,62,82,2016,2,36.9157056,-80.3876889
4562,Classic Espresso,True Beans Coffee Roasters,Not disclosed.,4.0,91,Aroma:,8,Body:,8,Flavor:,...,8,Blind Assessment: Evaluated as espresso. A ste...,Notes: Certified organically grown and Fair-Tr...,Who Should Drink It: No grand sensory gestures...,43,54,2010,10,33.7690164,-118.191604
1373,Colombia Finca La Loma Microlot,BeanFruit Coffee Co.,"Huila, Colombia",2.0,94,Aroma:,9,Acidity/Structure:,9,Body:,...,8,"Blind Assessment: High-toned, sweetly tart. Bi...",Notes: Produced at Finca La Loma of the Pink B...,The Bottom Line: An exceptional Colombia cup w...,58,76,2020,3,32.3086744,-90.1983063
1272,Instant Karma,modcup coffee,Brazil; Bali; Ethiopia,,90,Aroma:,8,Acidity/Structure:,8,Body:,...,8,Blind Assessment: Evaluated as an instant or s...,Notes: A blend of three coffees: a pulped natu...,The Bottom Line: A very appealing instant cup ...,0,0,2020,7,40.7215682,-74.047455
2451,Ethiopia Yirgacheffe,Turning Point Coffee,"Yirgacheffe growing region, southern Ethiopia",1.0,94,Aroma:,9,Acidity:,9,Body:,...,8,"Blind Assessment: Delicate, vibrantly sweet. H...",Notes: Southern Ethiopia coffees like this one...,"The Bottom Line: A crisp, delicate, tartly flo...",63,84,2016,11,37.7790262,-122.419906
883,Guatemala Cobán Chicoj Cooperative Red Honey F...,Small Eyes Cafe,"Cobán growing region, central Guatemala",2.0,94,Aroma:,9,Acidity/Structure:,9,Body:,...,8,"Blind Assessment: Vibrant, juicy, cocoa-toned....",Notes: Produced by members of the Cobán Chicoj...,The Bottom Line: A stone fruit and cocoa-drive...,58,78,2021,3,24.7519538,121.7533344
4487,Kenya Kangunu,Terroir Coffee,"Murang'a, south-central Kenya",3.0,95,Aroma:,9,Acidity:,10,Body:,...,8,Blind Assessment: Grandly pure and sweetly aci...,Notes: Affiliated with the 1340-member Kangunu...,Who Should Drink It: Those who can handle triu...,50,66,2011,1,42.4850931,-71.43284
3522,Lakeshore Blend Medium Roast,Caribou Coffee,Not disclosed.,4.0,90,Aroma:,8,Acidity:,8,Body:,...,8,"Blind Assessment: Crisp, gently bright, delica...","Notes: Like all Caribou coffees, this blend is...",Who Should Drink It: Those who enjoy a gracefu...,46,50,2013,11,44.9772995,-93.2654692


### Cleaning up country names to return more lat and long for coffee origin
There were lots of missing values when running with original data. Cleaning up origin names to improve return rate of lat and lon. Lat and long will be a bit less precise given replacing in many spots with country names.


In [61]:
data_df['coffee_origin'].unique()

array(['Trujillo, Valle del Cauca Department, Colombia',
       'Cusco Region, Peru', 'Nariño Department, Colombia', ...,
       'Fondo Paez Cooperative, Valle de Cauca Department, western Colombia.',
       'Yirgacheffe growing region, Sidamo Province, southern Ethiopia.',
       'Nilgiris, southern India.'], dtype=object)

In [62]:
data_df['coffee_origin'].value_counts()

Not disclosed.                                                           156
Yirgacheffe growing region, southern Ethiopia                            147
Yirgacheffe growing region, southern Ethiopia.                           109
Boquete growing region, western Panama                                   100
Nyeri growing region, south-central Kenya                                 99
                                                                        ... 
La Libertad, Huehuetenango, Guatemala                                      1
Cusco Department, La Convención Province, Santa Teresa District, Peru      1
San Cristobal, Cobán, Alta Verapaz, Guatemala                              1
Jamaica                                                                    1
Nilgiris, southern India.                                                  1
Name: coffee_origin, Length: 1794, dtype: int64

Using regex to cleanup origin locations:

In [63]:
data_df['coffee_origin'].replace(r'.*(Ethiopia).*', 'Ethiopia', inplace = True, regex=True)

In [64]:
data_df['coffee_origin'].replace(r'.*(Panama).*', 'Panama', inplace = True, regex=True)

In [65]:
data_df['coffee_origin'].replace(r'.*(El Salvador).*', 'El Salvador', inplace = True, regex=True)

In [66]:
data_df['coffee_origin'].replace(r'.*(Kenya).*', 'Kenya', inplace = True, regex=True)

In [67]:
data_df['coffee_origin'].replace(r'.*(Guatemala).*', 'Guatemala', inplace = True, regex=True)

In [68]:
data_df['coffee_origin'].replace(r'.*(Colombia).*', 'Colombia', inplace = True, regex=True)

In [69]:
data_df['coffee_origin'].replace(r'.*(Burundi).*', 'Burundi', inplace = True, regex=True)

In [70]:
data_df['coffee_origin'].replace(r'.*(Costa R.ca).*', 'Costa Rica', inplace = True, regex=True)

In [71]:
data_df['coffee_origin'].replace(r'.*(Honduras).*', 'Honduras', inplace = True, regex=True)

In [72]:
data_df['coffee_origin'].replace(r'.*(Rwanda).*', 'Rwanda', inplace = True, regex=True)

In [73]:
data_df['coffee_origin'].replace(r'.*(Hawaii).*', 'Hawaii', inplace = True, regex=True)

In [74]:
data_df['coffee_origin'].replace(r'.*(Hawai).*', 'Hawaii', inplace = True, regex=True)

In [75]:
data_df['coffee_origin'].replace(r'.*(Peru).*', 'Peru', inplace = True, regex=True)

In [76]:
data_df['coffee_origin'].replace(r'.*(India).*', 'India', inplace = True, regex=True)

In [77]:
data_df['coffee_origin'].replace(r'.*(Brazil).*', 'Brazil', inplace = True, regex=True)

In [78]:
data_df['coffee_origin'].replace(r'.*(Ecuador).*', 'Ecuador', inplace = True, regex=True)

In [79]:
data_df['coffee_origin'].replace(r'.*(Haiti).*', 'Haiti', inplace = True, regex=True)

In [80]:
data_df['coffee_origin'].replace(r'.*(Puerto Rico).*', 'Puerto Rico', inplace = True, regex=True)

In [81]:
data_df['coffee_origin'].replace(r'.*(Uganda).*', 'Uganda', inplace = True, regex=True)

In [82]:
data_df['coffee_origin'].replace(r'.*(Nicaragua).*', 'Nicaragua', inplace = True, regex=True)

In [83]:
data_df['coffee_origin'].replace(r'.*(Papua New Guinea).*', 'Papua New Guinea', inplace = True, regex=True)

In [84]:
data_df['coffee_origin'].replace(r'.*(Thailand).*', 'Thailand', inplace = True, regex=True)

In [85]:
data_df['coffee_origin'].replace(r'.*(Indonesia).*', 'Indonesia', inplace = True, regex=True)

In [86]:
data_df['coffee_origin'].replace(r'.*(Yemen).*', 'Yemen', inplace = True, regex=True)

In [87]:
data_df['coffee_origin'].replace(r'.*(Jamaica).*', 'Jamaica', inplace = True, regex=True)

In [88]:
data_df['coffee_origin'].replace(r'.*(Tanzania).*', 'Tanzania', inplace = True, regex=True)

In [89]:
data_df['coffee_origin'].replace(r'.*(Mexico).*', 'Mexico', inplace = True, regex=True)

In [90]:
data_df['coffee_origin'].replace(r'.*(Central Zambia).*', 'Central Zambia', inplace = True, regex=True)

In [91]:
data_df['coffee_origin'].replace(r'.*(Taiwan).*', 'Taiwan', inplace = True, regex=True)

In [92]:
data_df['coffee_origin'].replace(r'.*(Democratic Republic of the Congo).*', 'Democratic Republic of the Congo', inplace = True, regex=True)

In [93]:
data_df['coffee_origin'].replace(r'.*(Bolivia).*', 'Bolivia', inplace = True, regex=True)

In [94]:
data_df['coffee_origin'].replace(r'.*(Dominican Republic).*', 'Dominican Republic', inplace = True, regex=True)

In [95]:
data_df['coffee_origin'].replace(r'.*(China).*', 'China', inplace = True, regex=True)

In [96]:
data_df['coffee_origin'].replace(r'.*(Zimbabwe).*', 'Zimbabwe', inplace = True, regex=True)

In [97]:
data_df['coffee_origin'].replace(r'.*(Philippines).*', 'The Philippines', inplace = True, regex=True)

In [98]:
data_df['coffee_origin'].replace(r'.*(Malawi).*', 'Malawi', inplace = True, regex=True)

In [99]:
data_df['coffee_origin'].replace(r'.*(Vietnam).*', 'Vietnam', inplace = True, regex=True)

In [100]:
data_df['coffee_origin'].replace(r'.*(Zambia).*', 'Zambia', inplace = True, regex=True)

In [101]:
data_df['coffee_origin'].replace(r'.*(Malaysia).*', 'Malaysia', inplace = True, regex=True)

In [102]:
data_df['coffee_origin'].replace(r'.*(Democratic Republic of Congo).*', 'Democratic Republic of the Congo', inplace = True, regex=True)

In [103]:
data_df['coffee_origin'].replace(r'.*(Democratic Republic of Congo).*', 'Democratic Republic of the Congo', inplace = True, regex=True)

In [104]:
data_df['coffee_origin'].replace(r'.*(Laos).*', 'Laos', inplace = True, regex=True)

In [105]:
data_df['coffee_origin'].replace(r'.*(Myanmar).*', 'Myanmar', inplace = True, regex=True)

In [106]:
data_df['coffee_origin'].replace(r'.*(DRC Congo).*', 'Democratic Republic of the Congo', inplace = True, regex=True)

In [107]:
data_df['coffee_origin'].replace(r'.*(California).*', 'California', inplace = True, regex=True)

In [109]:
#getting lat and lon for coffee_origin locations

lat_origin = []
lon_origin = []

address_list_origin = data_df['coffee_origin'].to_numpy()

for address in address_list_origin:
    url = 'https://nominatim.openstreetmap.org/search/' + urllib.parse.quote(str(address)) +'?format=json'

    try:
        response = requests.get(url).json()
        lat_origin.append(response[0]["lat"])
        lon_origin.append(response[0]["lon"])
    except IndexError:
        lat_origin.append('')
        lon_origin.append('')

print(lat_origin)
print(lon_origin)

['4.099917', '-6.8699697', '4.099917', '12.6090157', '10.2116702', '-1.3397668', '10.2116702', '10.2116702', '1.4419683', '', '1.4419683', '10.2116702', '10.2116702', '19.593801499999998', '1.4419683', '23.9739374', '10.2116702', '1.4419683', '19.593801499999998', '14.8971921', '1.4419683', '-1.3397668', '15.9266657', '1.4419683', '19.593801499999998', '19.593801499999998', '1.4419683', '10.2116702', '10.2116702', '10.2116702', '-1.9646631', '15.5855545', '15.5855545', '15.5855545', '23.6585116', '23.6585116', '23.6585116', '23.6585116', '23.6585116', '23.6585116', '23.6585116', '23.6585116', '23.6585116', '1.4419683', '4.099917', '4.099917', '-6.8699697', '15.2572432', '4.099917', '19.1399952', '19.593801499999998', '-1.9646631', '10.2116702', '4.099917', '23.9739374', '10.2116702', '10.2116702', '10.2116702', '1.4419683', '-6.8699697', '23.6585116', '15.5855545', '19.593801499999998', '12.6090157', '10.2116702', '4.099917', '10.2116702', '4.099917', '4.099917', '-6.8699697', '-2.4833

In [110]:
#adding column to dataframe
data_df['origin_lat'] = lat_origin

In [111]:
#adding column to dataframe
data_df['origin_lon'] = lon_origin

In [112]:
#drop unneeded agtron column
data_df.drop('coffee_origin', axis=1, inplace=True)

In [113]:
#verifying accuracy of column adds
data_df.sample(5)

Unnamed: 0,coffee_name,roaster_name,roast_level,overall_score,h1,s1,h2,s2,h3,s3,...,p2,p3,bean_agtron,ground_agtron,year,month,roaster_location_lat,roaster_location_lon,origin_lat,origin_lon
3464,Ganesha Espresso (reviewed for drip applications),Tony's Coffees & Teas,3,92,Aroma:,8,Acidity:,7,Body:,9,...,Notes: Ganesha Espresso is Tony’s flagship esp...,Who Should Drink It: Those who prefer depth an...,51,62,2014,2,48.7544012,-122.478836,,
2639,San Ysidro Costa Rica,Axil Coffee Roasters,1,93,Aroma:,9,Acidity:,8,Body:,9,...,Notes: Produced from the Caturra and Catuai va...,"The Bottom Line: A flavor-saturated, floral-to...",68,80,2016,6,-37.8244246,145.0317207,10.2735633,-84.0739102
3118,Elevasio (Single-Serve Capsule),Nespresso,6,88,Aroma:,8,Acidity:,7,Body:,8,...,Notes: This coffee was produced in Switzerland...,Who Should Drink It: Those who enjoy the stron...,0,58,2015,2,46.5218269,6.6327025,4.099917,-72.9088133
2282,Duomo Espresso Northern Italian Style,Black Oak Coffee Roasters,4,91,Aroma:,9,Acidity:,8,Body:,8,...,Notes: A blend designed to reflect the norther...,"The Bottom Line: A quiet, accessible darker-ro...",45,55,2017,8,39.1501662,-123.2077861,,
488,Ethiopia Worka Sakaro Anaerobic Natural,Red Rock Roasters,3,94,Aroma:,9,Acidity/Structure:,9,Body:,9,...,Notes: Southern Ethiopia coffees like this one...,"The Bottom Line: A chocolaty, elegantly bitter...",56,64,2021,11,35.21287095,-106.71324849574629,10.2116702,38.6521203


In [114]:
#checking data types
data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4779 entries, 0 to 4778
Data columns (total 25 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   coffee_name           4779 non-null   object
 1   roaster_name          4779 non-null   object
 2   roast_level           4705 non-null   Int8  
 3   overall_score         4779 non-null   int64 
 4   h1                    4779 non-null   object
 5   s1                    4779 non-null   int64 
 6   h2                    4779 non-null   object
 7   s2                    4779 non-null   int64 
 8   h3                    4779 non-null   object
 9   s3                    4779 non-null   int64 
 10  h4                    4779 non-null   object
 11  s4                    4779 non-null   int64 
 12  h5                    4779 non-null   object
 13  s5                    4779 non-null   int64 
 14  p1                    4779 non-null   object
 15  p2                    4779 non-null   

In [115]:
data_df['origin_lat'].value_counts()

10.2116702             1261
1.4419683               479
4.099917                431
15.5855545              323
                        266
8.559559                240
-2.4833826              229
10.2735633              221
19.593801499999998      197
13.8000382              148
-1.9646631              118
-10.3333333             112
15.2572432               83
-6.8699697               82
12.6090157               71
-3.426449                63
23.6585116               52
-1.3397668               47
14.8971921               45
16.3471243               37
-5.6816069               37
-17.0568696              37
-6.5247123               32
-2.9814344               28
23.9739374               27
51.5086102               11
19.1399952               10
19.0974031                9
21.417531150000002        8
18.1850507                8
1.5333554                 7
22.3511148                6
-14.5189121               6
35.000074                 5
-25.746194000000003       4
-30.29284845        

In [116]:
#turning blank into nans
data_df['roaster_location_lat'].replace('', np.nan, inplace=True)

In [117]:
data_df['roaster_location_lat'] = data_df['roaster_location_lat'].astype(float)

In [118]:
#turning blank into nans
data_df['roaster_location_lon'].replace('', np.nan, inplace=True)

In [119]:
data_df['roaster_location_lon'] = data_df['roaster_location_lon'].astype(float)

In [120]:
#turning blank into nans
data_df['origin_lat'].replace('', np.nan, inplace=True)

In [121]:
data_df['origin_lat'] = data_df['origin_lat'].astype(float)

In [122]:
#turning blank into nans
data_df['origin_lon'].replace('', np.nan, inplace=True)

In [123]:
data_df['origin_lon'] = data_df['origin_lon'].astype(float)

In [124]:
data_df.head()

Unnamed: 0,coffee_name,roaster_name,roast_level,overall_score,h1,s1,h2,s2,h3,s3,...,p2,p3,bean_agtron,ground_agtron,year,month,roaster_location_lat,roaster_location_lon,origin_lat,origin_lon
0,Colombia Cerro Azul Enano,Equator Coffees,2,94,Aroma:,9,Acidity/Structure:,9,Body:,9,...,Notes: Produced at Finca Cerro Azul (also owne...,The Bottom Line: This rare Enano (dwarf Geish...,60,77,2022,10,37.973535,-122.531087,4.099917,-72.908813
1,Peru Incahuasi,Press Coffee,2,94,Aroma:,9,Acidity/Structure:,9,Body:,9,...,Notes: Produced at Incahuasi Farm from trees o...,The Bottom Line: Laden with tropical fruit not...,58,78,2022,10,33.448437,-112.074141,-6.86997,-75.045851
2,Colombia Aponte’s Guardians,Press Coffee,2,93,Aroma:,9,Acidity/Structure:,9,Body:,8,...,Notes: Produced at Aponte Farm from an undiscl...,"The Bottom Line: A balanced, inviting washed C...",59,77,2022,10,33.448437,-112.074141,4.099917,-72.908813
3,Nicaragua Flor de Dalia Natural,Equator Coffees,2,92,Aroma:,8,Acidity/Structure:,8,Body:,9,...,Notes: Produced by smallholding members of the...,"The Bottom Line: A refreshing, very sweet natu...",62,78,2022,10,37.973535,-122.531087,12.609016,-85.293691
4,Ethiopia Bench Maji Geisha G1 Natural,Taster's Coffee,1,93,Aroma:,9,Acidity/Structure:,9,Body:,8,...,Notes: Produced from trees of the admired bota...,The Bottom Line: A quietly confident natural-p...,65,81,2022,10,25.072134,121.679919,10.21167,38.65212


In [125]:
#checking data type updates
data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4779 entries, 0 to 4778
Data columns (total 25 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   coffee_name           4779 non-null   object 
 1   roaster_name          4779 non-null   object 
 2   roast_level           4705 non-null   Int8   
 3   overall_score         4779 non-null   int64  
 4   h1                    4779 non-null   object 
 5   s1                    4779 non-null   int64  
 6   h2                    4779 non-null   object 
 7   s2                    4779 non-null   int64  
 8   h3                    4779 non-null   object 
 9   s3                    4779 non-null   int64  
 10  h4                    4779 non-null   object 
 11  s4                    4779 non-null   int64  
 12  h5                    4779 non-null   object 
 13  s5                    4779 non-null   int64  
 14  p1                    4779 non-null   object 
 15  p2                   

### Exploring Header Values to Prep for Sorting them Out

In [126]:
data_df['h1'].unique()

array(['Aroma:', 'Acidity/Structure:', 'Acidity:'], dtype=object)

In [127]:
data_df['h1'].value_counts()

Aroma:                4743
Acidity/Structure:      24
Acidity:                12
Name: h1, dtype: int64

In [128]:
#turning NA into nans
data_df['h1'].replace(r'.*(Structure).*', 'Acidity:', inplace = True, regex=True)

In [129]:
data_df['h1'].value_counts()

Aroma:      4743
Acidity:      36
Name: h1, dtype: int64

In [130]:
data_df['h2'].unique()

array(['Acidity/Structure:', 'Body:', 'Acidity:'], dtype=object)

In [131]:
#turning NA into nans
data_df['h2'].replace(r'.*(Structure).*', 'Acidity:', inplace = True, regex=True)

In [132]:
data_df['h2'].value_counts()

Acidity:    4013
Body:        766
Name: h2, dtype: int64

In [133]:
data_df['h3'].unique()

array(['Body:', 'Flavor:'], dtype=object)

In [134]:
data_df['h3'].value_counts()

Body:      4013
Flavor:     766
Name: h3, dtype: int64

In [135]:
data_df['h4'].unique()

array(['Flavor:', 'Aftertaste:'], dtype=object)

In [136]:
data_df['h4'].value_counts()

Flavor:        4013
Aftertaste:     766
Name: h4, dtype: int64

In [137]:
data_df['h5'].unique()

array(['Aftertaste:', 'With Milk:'], dtype=object)

In [138]:
data_df['h5'].value_counts()

Aftertaste:    4013
With Milk:      766
Name: h5, dtype: int64

**Summary of Header Findings**
 
*Six Options with Counts per Header*
 
- Aroma - 4743 h1
- Acidity - 36 h1, 4013 h2
- Body - 766 h2, 4013 h3
- Flavor - 766 h3, 4013 h4
- Aftertaste - 766 h4, 4013 h5
- With Milk - 766 h5

In [139]:
aroma_count = 4743
aroma_count

4743

In [140]:
acidity_count = 36+4013
acidity_count

4049

In [141]:
body_count = 766+4013
body_count

4779

In [142]:
flavor_count = 766+4013
flavor_count

4779

In [143]:
aftertaste_count = 766+4013
aftertaste_count

4779

In [144]:
with_milk = 766
with_milk

766

In [145]:
data_df.head()

Unnamed: 0,coffee_name,roaster_name,roast_level,overall_score,h1,s1,h2,s2,h3,s3,...,p2,p3,bean_agtron,ground_agtron,year,month,roaster_location_lat,roaster_location_lon,origin_lat,origin_lon
0,Colombia Cerro Azul Enano,Equator Coffees,2,94,Aroma:,9,Acidity:,9,Body:,9,...,Notes: Produced at Finca Cerro Azul (also owne...,The Bottom Line: This rare Enano (dwarf Geish...,60,77,2022,10,37.973535,-122.531087,4.099917,-72.908813
1,Peru Incahuasi,Press Coffee,2,94,Aroma:,9,Acidity:,9,Body:,9,...,Notes: Produced at Incahuasi Farm from trees o...,The Bottom Line: Laden with tropical fruit not...,58,78,2022,10,33.448437,-112.074141,-6.86997,-75.045851
2,Colombia Aponte’s Guardians,Press Coffee,2,93,Aroma:,9,Acidity:,9,Body:,8,...,Notes: Produced at Aponte Farm from an undiscl...,"The Bottom Line: A balanced, inviting washed C...",59,77,2022,10,33.448437,-112.074141,4.099917,-72.908813
3,Nicaragua Flor de Dalia Natural,Equator Coffees,2,92,Aroma:,8,Acidity:,8,Body:,9,...,Notes: Produced by smallholding members of the...,"The Bottom Line: A refreshing, very sweet natu...",62,78,2022,10,37.973535,-122.531087,12.609016,-85.293691
4,Ethiopia Bench Maji Geisha G1 Natural,Taster's Coffee,1,93,Aroma:,9,Acidity:,9,Body:,8,...,Notes: Produced from trees of the admired bota...,The Bottom Line: A quietly confident natural-p...,65,81,2022,10,25.072134,121.679919,10.21167,38.65212


In [146]:
#creating placeholder for aroma scores
data_df['aroma_hold'] = data_df['h1'] + data_df['s1'].map(str)

In [147]:
#creating placeholder for acidity scores
data_df['acidity_hold'] = data_df['h1'] + data_df['s1'].map(str)

In [148]:
#removing acidity from aroma column
data_df['aroma_hold'].replace(r'.*(cidit).*', np.nan, inplace = True, regex=True)

In [149]:
#checking counts
data_df['aroma_hold'].value_counts()

Aroma:9     3194
Aroma:8     1301
Aroma:7      119
Aroma:10      95
Aroma:6       15
Aroma:5        9
Aroma:4        5
Aroma:2        3
Aroma:3        2
Name: aroma_hold, dtype: int64

In [150]:
#extracting aroma ratings
data_df['aroma'] = data_df['aroma_hold'].str.extract(r'([0-9]+)')

In [151]:
#verifying the correct number of aroma ratings are remaining
data_df['aroma'].value_counts().sum()

4743

In [152]:
#drop unneeded agtron column
data_df.drop('aroma_hold', axis=1, inplace=True)

In [153]:
#checking counts
data_df['acidity_hold'].value_counts()

Aroma:9      3194
Aroma:8      1301
Aroma:7       119
Aroma:10       95
Acidity:8      21
Aroma:6        15
Acidity:9      13
Aroma:5         9
Aroma:4         5
Aroma:2         3
Aroma:3         2
Acidity:7       2
Name: acidity_hold, dtype: int64

In [154]:
#removing acidity from aroma column
data_df['acidity_hold'].replace(r'.*(roma).*', np.nan, inplace = True, regex=True)

In [155]:
#creating placeholder for aroma scores
data_df['acidity2_hold'] = data_df['h2'] + data_df['s2'].map(str)

In [156]:
#checking counts
data_df['acidity2_hold'].value_counts()

Acidity:8     2028
Acidity:9     1639
Body:8         409
Body:9         313
Acidity:7      285
Body:7          42
Acidity:6       26
Acidity:10      15
Acidity:4        6
Acidity:3        5
Acidity:5        5
Body:6           2
Acidity:1        2
Acidity:2        2
Name: acidity2_hold, dtype: int64

In [157]:
#removing acidity from aroma column
data_df['acidity2_hold'].replace(r'.*(ody).*', np.nan, inplace = True, regex=True)

In [158]:
#creating placeholder for aroma scores
data_df['acidity_combo'] = data_df['acidity_hold'].map(str) + ' ' + data_df['acidity2_hold'].map(str)

In [159]:
data_df['acidity_combo'].value_counts()

nan Acidity:8     2028
nan Acidity:9     1639
nan nan            730
nan Acidity:7      285
nan Acidity:6       26
Acidity:8 nan       21
nan Acidity:10      15
Acidity:9 nan       13
nan Acidity:4        6
nan Acidity:3        5
nan Acidity:5        5
nan Acidity:1        2
nan Acidity:2        2
Acidity:7 nan        2
Name: acidity_combo, dtype: int64

In [160]:
#extracting aroma ratings
data_df['acidity'] = data_df['acidity_combo'].str.extract(r'([0-9]+)')

In [161]:
#verifying updates
data_df['acidity'].value_counts()

8     2049
9     1652
7      287
6       26
10      15
4        6
3        5
5        5
1        2
2        2
Name: acidity, dtype: int64

In [162]:
#drop unneeded columns
data_df.drop(['acidity_combo', 'acidity2_hold', 'acidity_hold'], axis=1, inplace=True)

In [163]:
#creating placeholder for body scores
data_df['body_hold'] = data_df['h2'] + data_df['s2'].map(str)

In [164]:
#removing acidity from body column
data_df['body_hold'].replace(r'.*(cidit).*', np.nan, inplace = True, regex=True)

In [165]:
#creating 2nd placeholder for body scores
data_df['body2_hold'] = data_df['h3'] + data_df['s3'].map(str)

In [166]:
#combining body scores
data_df['body_combo'] = data_df['body_hold'].map(str) + ' ' + data_df['body2_hold'].map(str)

In [167]:
#extracting body ratings
data_df['body'] = data_df['body_combo'].str.extract(r'([0-9]+)')

In [168]:
#drop unneeded columns
data_df.drop(['body_combo', 'body2_hold', 'body_hold'], axis=1, inplace=True)

In [169]:
#creating placeholder for flavor scores
data_df['flavor_hold'] = data_df['h3'] + data_df['s3'].map(str)

In [170]:
#creating placeholder for 2nd flavor scores
data_df['flavor2_hold'] = data_df['h4'] + data_df['s4'].map(str)

In [171]:
#removing body from flavor column
data_df['flavor_hold'].replace(r'.*(ody).*', np.nan, inplace = True, regex=True)

In [172]:
#removing body from flavor column
data_df['flavor2_hold'].replace(r'.*(ftertas).*', np.nan, inplace = True, regex=True)

In [173]:
#combining flavor scores
data_df['flavor_combo'] = data_df['flavor_hold'].map(str) + ' ' + data_df['flavor2_hold'].map(str)

In [174]:
#extracting body ratings
data_df['flavor'] = data_df['flavor_combo'].str.extract(r'([0-9]+)')

In [175]:
data_df['flavor'].value_counts().sum()

4779

In [176]:
#drop unneeded columns
data_df.drop(['flavor_combo', 'flavor2_hold', 'flavor_hold'], axis=1, inplace=True)

In [177]:
#creating placeholder for aftertaste scores
data_df['aftertaste_hold'] = data_df['h4'] + data_df['s4'].map(str)

In [178]:
#creating placeholder for aftertaste scores
data_df['aftertaste2_hold'] = data_df['h5'] + data_df['s5'].map(str)

In [179]:
#removing flavor from aftertaste column
data_df['aftertaste_hold'].replace(r'.*(lavo).*', np.nan, inplace = True, regex=True)

In [180]:
#removing with milk from aftertaste column
data_df['aftertaste2_hold'].replace(r'.*(ilk).*', np.nan, inplace = True, regex=True)

In [181]:
#combining flavor scores
data_df['aftertaste_combo'] = data_df['aftertaste_hold'].map(str) + ' ' + data_df['aftertaste2_hold'].map(str)

In [182]:
#extracting body ratings
data_df['aftertaste'] = data_df['aftertaste_combo'].str.extract(r'([0-9]+)')

In [183]:
#drop unneeded columns
data_df.drop(['aftertaste_combo', 'aftertaste2_hold', 'aftertaste_hold'], axis=1, inplace=True)

In [184]:
#creating placeholder for with milk scores
data_df['with_milk_hold'] = data_df['h5'] + data_df['s5'].map(str)

In [185]:
#removing flavor from aftertaste column
data_df['with_milk_hold'].replace(r'.*(ftertas).*', np.nan, inplace = True, regex=True)

In [186]:
#extracting body ratings
data_df['with_milk'] = data_df['with_milk_hold'].str.extract(r'([0-9]+)')

In [187]:
#drop unneeded columns
data_df.drop('with_milk_hold', axis=1, inplace=True)

In [188]:
#drop unneeded columns
data_df.drop(['h1', 's1', 'h2', 's2', 'h3', 's3', 'h4', 's4', 'h5', 's5'], axis=1, inplace=True)

In [189]:
data_df.head()

Unnamed: 0,coffee_name,roaster_name,roast_level,overall_score,p1,p2,p3,bean_agtron,ground_agtron,year,...,roaster_location_lat,roaster_location_lon,origin_lat,origin_lon,aroma,acidity,body,flavor,aftertaste,with_milk
0,Colombia Cerro Azul Enano,Equator Coffees,2,94,Blind Assessment: Elegantly fruit- and cocoa-t...,Notes: Produced at Finca Cerro Azul (also owne...,The Bottom Line: This rare Enano (dwarf Geish...,60,77,2022,...,37.973535,-122.531087,4.099917,-72.908813,9,9,9,9,8,
1,Peru Incahuasi,Press Coffee,2,94,"Blind Assessment: Gently fruit-toned, integrat...",Notes: Produced at Incahuasi Farm from trees o...,The Bottom Line: Laden with tropical fruit not...,58,78,2022,...,33.448437,-112.074141,-6.86997,-75.045851,9,9,9,9,8,
2,Colombia Aponte’s Guardians,Press Coffee,2,93,"Blind Assessment: Richly sweet, spice-toned. L...",Notes: Produced at Aponte Farm from an undiscl...,"The Bottom Line: A balanced, inviting washed C...",59,77,2022,...,33.448437,-112.074141,4.099917,-72.908813,9,9,8,9,8,
3,Nicaragua Flor de Dalia Natural,Equator Coffees,2,92,"Blind Assessment: Gently fruit-forward, sweetl...",Notes: Produced by smallholding members of the...,"The Bottom Line: A refreshing, very sweet natu...",62,78,2022,...,37.973535,-122.531087,12.609016,-85.293691,8,8,9,9,8,
4,Ethiopia Bench Maji Geisha G1 Natural,Taster's Coffee,1,93,"Blind Assessment: Gently sweet-tart, floral-to...",Notes: Produced from trees of the admired bota...,The Bottom Line: A quietly confident natural-p...,65,81,2022,...,25.072134,121.679919,10.21167,38.65212,9,9,8,9,8,


In [190]:
#updating cleaned score value data types to integer
data_df['aroma'] = data_df['aroma'].astype('Int8')

In [191]:
data_df['acidity'] = data_df['acidity'].astype('Int8')

In [192]:
data_df['body'] = data_df['body'].astype('Int8')

In [193]:
data_df['flavor'] = data_df['flavor'].astype('Int8')

In [194]:
data_df['aftertaste'] = data_df['aftertaste'].astype('Int8')

In [195]:
data_df['with_milk'] = data_df['with_milk'].astype('Int8')

In [196]:
data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4779 entries, 0 to 4778
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   coffee_name           4779 non-null   object 
 1   roaster_name          4779 non-null   object 
 2   roast_level           4705 non-null   Int8   
 3   overall_score         4779 non-null   int64  
 4   p1                    4779 non-null   object 
 5   p2                    4779 non-null   object 
 6   p3                    4773 non-null   object 
 7   bean_agtron           4742 non-null   Int8   
 8   ground_agtron         4743 non-null   Int8   
 9   year                  4779 non-null   int64  
 10  month                 4779 non-null   int64  
 11  roaster_location_lat  4738 non-null   float64
 12  roaster_location_lon  4738 non-null   float64
 13  origin_lat            4513 non-null   float64
 14  origin_lon            4513 non-null   float64
 15  aroma                

In [197]:
data_df.to_csv('1st_clean_data_df.csv', index=False)

## See "Capstone Project Part 3" Notebook for Next Steps