In [1332]:
### use this as template to clean second chunk of data
## continue with lat and lon
#add commentary - if redoing would sort out columns first, then combine and clean up datatypes

# Capstone Project Part 3: Cleaning "Older_Data_DF"

**Authur:** Kate Meredith  

**Date:** September-November 2022

**Notebook #:** 3 of

## Background

**Source:** Data was collected from [CoffeeReview.com](https://www.coffeereview.com/) and grouped into two DataFrames for cleaning. See "Capstone Project Part 1" for more information on scraping. See notebook on cleaning "older_data_df" for cleaning of that portion of data.

**Initial Data Overview:** The following outlines the column headers for the data initially scraped and corresponding clean up plan:

Independent Variables:
- `coffee_name` 
    - def: name of coffee reviewed
    - options: use text vectoring to tokenize or drop; one-hot encoding likely impractical given number of variations in names
- `roaster_name` 
    - def: name of roaster that roasted the coffee
    - options: one-hot encoding or drop
- `roaster_location` 
    - def: location of coffee roaster
    - turn into latitude and longitude
- `x1`
    - def: placedholder for header, expect to see `coffee_origin`
    - these headers will become column headers in the DataFrame; webpage format was inconsistent, so pulled headers in one column, followed by score in next so that scores can be correctly attributed to the right header
- `coffee_origin` 
    - def: location of coffee bean origination
    - turn into latitude and longitude
- `x2`
    - def: placedholder for header, expect to see `roast_level`
    - these headers will become column headers in the DataFrame; webpage format was inconsistent, so pulled headers in one column, followed by score in next so that scores can be correctly attributed to the right header- `roast_level` 
    - def: describes how long and thoroughly the beans are cooked, ranging from light to dark
    - options: encode numerically (ordinal values) or drop; agtron scores measure roast level with number so may drop this
- `x3`
    - def: placedholder for header, expect to see `agtron`
    - these headers will become column headers in the DataFrame; webpage format was inconsistent, so pulled headers in one column, followed by score in next so that scores can be correctly attributed to the right header
- `agtron` 
    - def: numerical measure of how roasted the beans are; first number is taken from measuring the whole beans, second from the grounds
    - options: split into two columns, one for `bean_agtron`, one for `ground_agtron`
- `x4`
    - def: placedholder for header, expect to see `est_price`
    - these headers will become column headers in the DataFrame; webpage format was inconsistent, so pulled headers in one column, followed by score in next so that scores can be correctly attributed to the right header
- `est_price`
    - def: est cost of the coffee
    - options: clean up and turn into measure that can be compared (such as USD cost per ounce) or drop; dropping may be necessary for sake of time given how complex clean up is (different measurement amounts and inflation affecting cost (data spans 1997 to 2022)
- `h1`
    - def: placedholder for first header
    - these headers will become column headers in the DataFrame; webpage format was inconsistent, so pulled headers in one column, followed by score in next so that scores can be correctly attributed to the right header
- `s1`
    - def: first score listed on webpage
    - webpage format was inconsistent; need to verify which header the score corresponds to and move to the appropriate new column
- `h2`
    - def: placedholder for second header
    - see header above
- `s2`
    - def: second score listed on webpage
    - see above
- `h3`
    - def: placedholder for third header
    - see header above
- `s3`
    - def: third score listed on webpage
    - see above
- `p1`
    - def: first paragraph scraped from webpage
    - verify if paragraphs all report same type of information; then use text vecotrizing to tokenize
- `p2`
    - def: second paragraph scraped from webpage
    - verify if paragraphs all report same type of information; then use text vecotrizing to tokenize
- `p3`
    - def: third paragraph scraped from webpage
    - verify if paragraphs all report same type of information; then use text vecotrizing to tokenize; expect there to be some difference in type of content in paragraph 3 as format changed over the years

Target Variable:
- `overall_score`
    - def: overall score awarded to coffee
    - verify data type is numeric, should not need altering otherwise
    
Note: Unlike the first set, this set does not have results for `h4`, `s4`, `h5`, or `s5`.

## References

- Referenced this [article](https://www.geeksforgeeks.org/split-a-text-column-into-two-columns-in-pandas-dataframe/) for splitting columns and keeping in dataframe 
- For [splitting columns with nans](https://stackoverflow.com/questions/69354795/how-to-skip-nan-values-when-splitting-up-a-column)
- References for [cleaning up measurement columns](https://www.geeksforgeeks.org/how-to-replace-values-in-column-based-on-condition-in-pandas/)
- Used [Regex 101](https://regex101.com/) to verify regex replacements
- Used these articles to address turning datatypes to integers with nan present:
    - [Changing to nans](https://stackoverflow.com/questions/34794067/how-to-set-a-cell-to-nan-in-a-pandas-dataframe)
    - [Using numpy to allow nans in integers](https://stackoverflow.com/questions/21287624/convert-pandas-column-containing-nans-to-dtype-int)
- Used this stack overflow [article](https://stackoverflow.com/questions/25888396/how-to-get-latitude-longitude-with-python) to acquire lat and long for locations
- Used this [article](https://stackoverflow.com/questions/55078654/troubelshooting-raise-typeerrorquote-from-bytes-expected-bytes) to address "quote from bytes" error when adding lat and lon.

In [1333]:
#Importing libraries.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [1334]:
#import first dataset
older_df = pd.read_csv('older_data_df.csv')

In [1335]:
older_df.shape

(1788, 21)

In [1336]:
older_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1788 entries, 0 to 1787
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   coffee_name       1788 non-null   object
 1   roaster_name      1788 non-null   object
 2   overall_score     1788 non-null   object
 3   roaster_location  1786 non-null   object
 4   x1                1788 non-null   object
 5   coffee_origin     1566 non-null   object
 6   x2                1788 non-null   object
 7   roast_level       1732 non-null   object
 8   x3                1788 non-null   object
 9   agtron            1788 non-null   object
 10  x4                1788 non-null   object
 11  est_price         1727 non-null   object
 12  h1                1788 non-null   object
 13  s1                1785 non-null   object
 14  h2                1788 non-null   object
 15  s2                1783 non-null   object
 16  h3                1788 non-null   object
 17  s3            

In [1337]:
older_df.head()

Unnamed: 0,coffee_name,roaster_name,overall_score,roaster_location,x1,coffee_origin,x2,roast_level,x3,agtron,...,est_price,h1,s1,h2,s2,h3,s3,p1,p2,p3
0,Doi Chaang Wild Civet Coffee,Doi Chaang Coffee,90,"Calgary, Alberta, Canada",Coffee Origin:,Northern Thailand.,Roast Level:,Medium-Light,Agtron:,49/80,...,June 2009,Aroma:,8,Acidity:,8,Body:,7\t\t\t\t\t\t\t,Blind Assessment: Intriguing mid-tones through...,Notes: Doi Chaang is a single-estate coffee pr...,Who Should Drink It: Culinary adventurers who ...
1,Kenya AA Lenana,Après Coffee,92,"Lancaster, Pennsylvania",Coffee Origin:,South-central Kenya,Roast Level:,Medium-Dark,Agtron:,42/53,...,June 2009,Aroma:,8,Acidity:,8,Body:,7\t\t\t\t\t\t\t,"Blind Assessment: Rich, very intense aroma: da...",Notes: Despite stresses brought on by social u...,Who Should Drink It: Those who prefer understa...
2,Mele 100% Kona Coffee,Hula Daddy,92,"Holualoa, Hawaii",Coffee Origin:,"Holualoa, North Kona growing district, Hawaii.",Roast Level:,Medium-Light,Agtron:,51/73,...,June 2009,Aroma:,8,Acidity:,8,Body:,7\t\t\t\t\t\t\t,"Blind Assessment: An exciting, rather unusual ...",Notes: A blend of coffees processed by a varie...,Who Should Drink It: An exhilarating sensory r...
3,Kenya Peaberry Thika Gethumbwini,JBC Coffee Roasters,96,"Madison, Wisconsin",Coffee Origin:,South-central Kenya,Roast Level:,Light,Agtron:,57/90,...,June 2009,Aroma:,8,Acidity:,9,Body:,8\t\t\t\t\t\t\t,"Blind Assessment: Clean, complex, impeccable. ...",Notes: Despite national coffee leadership mark...,Who Should Drink It: Strikingly complete expre...
4,Kenya Gititu Peaberry,Atomic Cafe Coffee Roasters,92,"Beverly, Massachusetts",Coffee Origin:,South-central Kenya,Roast Level:,Medium,Agtron:,44/63,...,June 2009,Aroma:,8,Acidity:,8,Body:,8\t\t\t\t\t\t\t,"Blind Assessment: Rich, complex fruit aroma, i...",Notes: Despite stresses brought on by social u...,Who Should Drink It: Lovers of sweet fruit fla...


### Checking Values in Each Column

In [1338]:
#coffee name value counts
older_df['coffee_name'].value_counts()

Sumatra Mandheling                        18
Kenya AA                                  16
Breakfast Blend                           15
Ethiopia Yirgacheffe                      14
Sumatra                                   12
                                          ..
Detroit St. Decaf                          1
RayJen Roasters Reserve Decaf Espresso     1
PT’s Decaf Espresso Blend                  1
Dancing Goats Decaf                        1
Colombian Blend                            1
Name: coffee_name, Length: 1428, dtype: int64

In [1339]:
#roaster name value counts
older_df['roaster_name'].value_counts()

Green Mountain Coffee         93
Paradise Roasters             42
The Roasterie                 42
Terroir Coffee                40
PT's Coffee Roasting Co.      40
                              ..
Santa Cruz Coffee Roasting     1
Cafe Bello                     1
Kona Premium Coffee            1
North Pole Coffee Roasting     1
Greenwich Mountain Estate      1
Name: roaster_name, Length: 581, dtype: int64

In [1340]:
#roast location value counts
older_df['roaster_location'].value_counts()

Seattle, Washington         103
Waterbury, Vermont          100
Acton, Massachusetts         47
Kansas City, Missouri        39
Chicago, Illinois            38
                           ... 
Sturgeon Bay, Wisconsin       1
Monrovia, California          1
Canton, Massachusetts         1
New Salem, Massachusetts      1
Duluth, Minnesota             1
Name: roaster_location, Length: 430, dtype: int64

In [1341]:
#x1 value counts
older_df['x1'].value_counts()

Coffee Origin:    1309
Roast Level:       479
Name: x1, dtype: int64

In [1342]:
#coffee origin value counts
older_df['coffee_origin'].value_counts()

Not disclosed.                                                       178
Medium-Dark                                                           99
Not disclosed                                                         78
Dark                                                                  57
Not Disclosed                                                         49
                                                                    ... 
Bolivia                                                                1
Central America and Africa                                             1
Guatemala and the Yirgacheffe growing region of southern Ethiopia      1
Latin America and Indonesia                                            1
Indonesia, Central and South America                                   1
Name: coffee_origin, Length: 454, dtype: int64

In [1343]:
#x2 value counts
older_df['x2'].value_counts()

Roast Level:    1309
Agtron:          479
Name: x2, dtype: int64

In [1344]:
#roast level value counts
#note some of these jump to agtron values
older_df['roast_level'].value_counts()

Medium          371
Medium-Dark     331
/               211
Medium-Light    196
Very Dark       192
               ... 
41/45             1
43/44             1
35/42             1
32/34             1
56/65             1
Name: roast_level, Length: 195, dtype: int64

In [1345]:
#x3 value counts
older_df['x3'].value_counts()

Agtron:         1309
Review Date:     479
Name: x3, dtype: int64

In [1346]:
#agtron value counts
#note some of these jump to dates
older_df['agtron'].value_counts()

0/0           40
March 2002    18
/             16
March 1999    15
June 1998     14
              ..
19/37          1
26/48          1
33/44          1
36/47          1
41/48          1
Name: agtron, Length: 708, dtype: int64

In [1347]:
#x4 value counts
older_df['x4'].value_counts()

Review Date:    1247
Aroma:           479
Est. Price:       62
Name: x4, dtype: int64

In [1348]:
#est_price value counts
#mix of scores, dates, cost
older_df['est_price'].value_counts()

7                   155
8                   109
6                    87
January 2009         37
5                    30
                   ... 
$11.95/16 ounces      1
December 1999         1
November 1999         1
4.0                   1
May 2002              1
Name: est_price, Length: 137, dtype: int64

In [1349]:
#review_date value counts
#mix of scores, dates
older_df['x1'].value_counts()

Coffee Origin:    1309
Roast Level:       479
Name: x1, dtype: int64

In [1350]:
#overall_score value counts
older_df['overall_score'].value_counts()

90    207
88    196
87    174
89    150
91    124
85    122
86    118
92    112
93     82
84     79
81     48
80     45
83     44
94     43
82     42
78     41
79     37
95     26
77     16
76     14
96     13
73     11
75      7
97      6
72      6
74      6
71      6
70      4
NR      4
65      2
68      2
60      1
Name: overall_score, dtype: int64

In [1351]:
#h1 value counts
older_df['h1'].value_counts()

Aroma:          1246
Acidity:         469
Review Date:      62
Body:             11
Name: h1, dtype: int64

In [1352]:
#s1 value counts
#mix of dates and scores?
older_df['s1'].value_counts()

8                738
7                391
9                182
6                155
5                106
                ... 
8.7                1
7.0                1
4.0                1
November 1998      1
March 1997         1
Name: s1, Length: 90, dtype: int64

In [1353]:
#h2 value counts
older_df['h2'].value_counts()

Acidity:    958
Body:       757
Aroma:       62
Flavor:      11
Name: h2, dtype: int64

In [1354]:
#s2 value counts
older_df['s2'].value_counts()

7                    502
8                    389
7\t\t\t\t\t\t\t      299
8\t\t\t\t\t\t\t      203
6\t\t\t\t\t\t\t      132
6                     66
5\t\t\t\t\t\t\t       45
9                     40
9\t\t\t\t\t\t\t       14
5                     13
na                     7
4\t\t\t\t\t\t\t        6
5.3\t\t\t\t\t\t\t      5
4                      5
5.7\t\t\t\t\t\t\t      5
5.8\t\t\t\t\t\t\t      4
6.4\t\t\t\t\t\t\t      4
7.0\t\t\t\t\t\t\t      4
5.2\t\t\t\t\t\t\t      3
5.1\t\t\t\t\t\t\t      3
5.6\t\t\t\t\t\t\t      3
5.5\t\t\t\t\t\t\t      3
6.0\t\t\t\t\t\t\t      3
5.9\t\t\t\t\t\t\t      3
10                     2
6.1\t\t\t\t\t\t\t      2
7.1\t\t\t\t\t\t\t      2
6.2\t\t\t\t\t\t\t      2
6.3\t\t\t\t\t\t\t      2
6.6\t\t\t\t\t\t\t      1
10\t\t\t\t\t\t\t       1
5.0\t\t\t\t\t\t\t      1
4.8\t\t\t\t\t\t\t      1
5.4\t\t\t\t\t\t\t      1
NR\t\t\t\t\t\t\t       1
3                      1
7.3\t\t\t\t\t\t\t      1
4.3\t\t\t\t\t\t\t      1
4.9\t\t\t\t\t\t\t      1
6.7\t\t\t\t\t\t\t      1


In [1355]:
#h3 value counts
older_df['h3'].value_counts()

Body:          967
Flavor:        757
Acidity:        53
With Milk:       9
Aftertaste:      2
Name: h3, dtype: int64

In [1356]:
#s3 value counts
older_df['s3'].value_counts()

8\t\t\t\t\t\t\t      466
7\t\t\t\t\t\t\t      404
7                    275
8                    260
6                    102
5                     70
6\t\t\t\t\t\t\t       62
9                     27
9\t\t\t\t\t\t\t       26
4                     17
5\t\t\t\t\t\t\t        7
5.3                    6
5.7                    5
Flavor in milk: 5      4
3                      4
5.8                    3
6.5                    3
6.1                    3
6.0                    3
7.4                    3
6.7                    3
6.9                    3
3.7                    2
7.7                    2
7.3                    2
10                     2
5.6                    2
Flavor in milk: 6      2
4.4                    1
Flavor in milk: 7      1
8.1                    1
5.1                    1
5.5                    1
3.9                    1
6.2                    1
6.4                    1
4.9                    1
4.6                    1
5.0                    1
4.8                    1


### Sorting out columns with conflicting information

Goal: Sort overlapping column values into following colummns. Values found specifies in which scraped columns the data ended up and how they were paired (headers and values) to sort out the data.
 

- coffee_origin --> origin
    - values found in: 1) x1 and coffee_origin 
    - COMPLETE

- roast_level --> roast
    - values found in: 1) x1 and coffee_origin, 2) x2 and roast_level 
    - COMPLETE

- agtron: 
    - values found in: 1) x2 and roast_level, 2) x3 and agtron 
    - COMPLETE

- review data: 
    - values found in: 1) X3 and agtron, 2) x4 and est_price, 3) h1 and s1
    - complete

- est_price: 
    - 1)x4 and est_price
    - complete

- Aroma:
    - 1) x4 and est_price, 2) h1 and s1, 3) h2 and s2
    - complete

- Acidity:
    - 1) h1 and s1, 2) h2 and s2, 3) h3 and s3
    - complete

- Body:
    - 1) h1 and s1, 2) h2 and s2, 3) h3 and s3
    - complete
- Flavor:
    - 1) h2 and s2, 2) h3 and s3
    - done
- Aftertaste:
    - 1) h3 and s3
    - done
- With Milk:
    - 1) h3 and s3
    - done

Extracting origin and roast values their own columns:

In [1357]:
#creating placeholder for origin scores
older_df['origin_hold'] = older_df['x1'].map(str) + older_df['coffee_origin'].map(str)

In [1358]:
#creating placeholder for first roast_level scores
older_df['roast_hold1'] = older_df['x1'].map(str) + older_df['coffee_origin'].map(str)

In [1359]:
#roast value counts
older_df['roast_hold1'].value_counts()

Roast Level:nan                                                                    222
Coffee Origin:Not disclosed.                                                       178
Roast Level:Medium-Dark                                                             99
Coffee Origin:Not disclosed                                                         78
Roast Level:Dark                                                                    57
                                                                                  ... 
Coffee Origin:Bolivia                                                                1
Coffee Origin:Central America and Africa                                             1
Coffee Origin:Guatemala and the Yirgacheffe growing region of southern Ethiopia      1
Coffee Origin:Latin America and Indonesia                                            1
Coffee Origin:Indonesia, Central and South America                                   1
Name: roast_hold1, Length: 455, dtype: int6

In [1360]:
#getting rid of roast data in origin column
older_df['origin_hold'].replace(r'.*(oast).*', '', inplace = True, regex=True)

In [1361]:
#split origin so that we can put origin under correct header and drop unneeded data
older_df[['origin_hold','origin']] = older_df['origin_hold'].str.split(":", n=1, expand=True)

In [1362]:
#drop unneeded origin hold hold columns
older_df.drop(['origin_hold'], axis=1, inplace=True)

In [1363]:
older_df.head()

Unnamed: 0,coffee_name,roaster_name,overall_score,roaster_location,x1,coffee_origin,x2,roast_level,x3,agtron,...,s1,h2,s2,h3,s3,p1,p2,p3,roast_hold1,origin
0,Doi Chaang Wild Civet Coffee,Doi Chaang Coffee,90,"Calgary, Alberta, Canada",Coffee Origin:,Northern Thailand.,Roast Level:,Medium-Light,Agtron:,49/80,...,8,Acidity:,8,Body:,7\t\t\t\t\t\t\t,Blind Assessment: Intriguing mid-tones through...,Notes: Doi Chaang is a single-estate coffee pr...,Who Should Drink It: Culinary adventurers who ...,Coffee Origin:Northern Thailand.,Northern Thailand.
1,Kenya AA Lenana,Après Coffee,92,"Lancaster, Pennsylvania",Coffee Origin:,South-central Kenya,Roast Level:,Medium-Dark,Agtron:,42/53,...,8,Acidity:,8,Body:,7\t\t\t\t\t\t\t,"Blind Assessment: Rich, very intense aroma: da...",Notes: Despite stresses brought on by social u...,Who Should Drink It: Those who prefer understa...,Coffee Origin:South-central Kenya,South-central Kenya
2,Mele 100% Kona Coffee,Hula Daddy,92,"Holualoa, Hawaii",Coffee Origin:,"Holualoa, North Kona growing district, Hawaii.",Roast Level:,Medium-Light,Agtron:,51/73,...,8,Acidity:,8,Body:,7\t\t\t\t\t\t\t,"Blind Assessment: An exciting, rather unusual ...",Notes: A blend of coffees processed by a varie...,Who Should Drink It: An exhilarating sensory r...,"Coffee Origin:Holualoa, North Kona growing dis...","Holualoa, North Kona growing district, Hawaii."
3,Kenya Peaberry Thika Gethumbwini,JBC Coffee Roasters,96,"Madison, Wisconsin",Coffee Origin:,South-central Kenya,Roast Level:,Light,Agtron:,57/90,...,8,Acidity:,9,Body:,8\t\t\t\t\t\t\t,"Blind Assessment: Clean, complex, impeccable. ...",Notes: Despite national coffee leadership mark...,Who Should Drink It: Strikingly complete expre...,Coffee Origin:South-central Kenya,South-central Kenya
4,Kenya Gititu Peaberry,Atomic Cafe Coffee Roasters,92,"Beverly, Massachusetts",Coffee Origin:,South-central Kenya,Roast Level:,Medium,Agtron:,44/63,...,8,Acidity:,8,Body:,8\t\t\t\t\t\t\t,"Blind Assessment: Rich, complex fruit aroma, i...",Notes: Despite stresses brought on by social u...,Who Should Drink It: Lovers of sweet fruit fla...,Coffee Origin:South-central Kenya,South-central Kenya


In [1364]:
#creating placeholder for 2nd roast scores
older_df['roast2_hold'] = older_df['x2'].map(str) + older_df['roast_level'].map(str)

In [1365]:
#getting rid of non-roast data
older_df['roast2_hold'].replace(r'.*(gtron).*', '', inplace = True, regex=True)

In [1366]:
#getting rid of non-roast data
older_df['roast_hold1'].replace(r'.*(rigin).*', '', inplace = True, regex=True)

In [1367]:
#creating placeholder for roast scores
older_df['roast_combo'] = older_df['roast2_hold'].map(str) + ' ' + older_df['roast_hold1'].map(str)

In [1368]:
#split roast values out from headers
older_df[['roast_combo','roast']] = older_df['roast_combo'].str.split(":", n=1, expand=True)

In [1369]:
#drop unneeded roast hold columns
older_df.drop(['roast_combo', 'roast2_hold', 'roast_hold1'], axis=1, inplace=True)

In [1370]:
#seeing some duplicate values, getting rid of spaces so they match up
older_df['roast'] = older_df['roast'].str.strip()

In [1371]:
older_df.head()

Unnamed: 0,coffee_name,roaster_name,overall_score,roaster_location,x1,coffee_origin,x2,roast_level,x3,agtron,...,s1,h2,s2,h3,s3,p1,p2,p3,origin,roast
0,Doi Chaang Wild Civet Coffee,Doi Chaang Coffee,90,"Calgary, Alberta, Canada",Coffee Origin:,Northern Thailand.,Roast Level:,Medium-Light,Agtron:,49/80,...,8,Acidity:,8,Body:,7\t\t\t\t\t\t\t,Blind Assessment: Intriguing mid-tones through...,Notes: Doi Chaang is a single-estate coffee pr...,Who Should Drink It: Culinary adventurers who ...,Northern Thailand.,Medium-Light
1,Kenya AA Lenana,Après Coffee,92,"Lancaster, Pennsylvania",Coffee Origin:,South-central Kenya,Roast Level:,Medium-Dark,Agtron:,42/53,...,8,Acidity:,8,Body:,7\t\t\t\t\t\t\t,"Blind Assessment: Rich, very intense aroma: da...",Notes: Despite stresses brought on by social u...,Who Should Drink It: Those who prefer understa...,South-central Kenya,Medium-Dark
2,Mele 100% Kona Coffee,Hula Daddy,92,"Holualoa, Hawaii",Coffee Origin:,"Holualoa, North Kona growing district, Hawaii.",Roast Level:,Medium-Light,Agtron:,51/73,...,8,Acidity:,8,Body:,7\t\t\t\t\t\t\t,"Blind Assessment: An exciting, rather unusual ...",Notes: A blend of coffees processed by a varie...,Who Should Drink It: An exhilarating sensory r...,"Holualoa, North Kona growing district, Hawaii.",Medium-Light
3,Kenya Peaberry Thika Gethumbwini,JBC Coffee Roasters,96,"Madison, Wisconsin",Coffee Origin:,South-central Kenya,Roast Level:,Light,Agtron:,57/90,...,8,Acidity:,9,Body:,8\t\t\t\t\t\t\t,"Blind Assessment: Clean, complex, impeccable. ...",Notes: Despite national coffee leadership mark...,Who Should Drink It: Strikingly complete expre...,South-central Kenya,Light
4,Kenya Gititu Peaberry,Atomic Cafe Coffee Roasters,92,"Beverly, Massachusetts",Coffee Origin:,South-central Kenya,Roast Level:,Medium,Agtron:,44/63,...,8,Acidity:,8,Body:,8\t\t\t\t\t\t\t,"Blind Assessment: Rich, complex fruit aroma, i...",Notes: Despite stresses brought on by social u...,Who Should Drink It: Lovers of sweet fruit fla...,South-central Kenya,Medium


Extracting agtron scores into its own column:

In [1372]:
#creating placeholder for 1st agtron scores
older_df['agtron1_hold'] = older_df['x2'].map(str) + older_df['roast_level'].map(str)

In [1373]:
#creating placeholder for 2nd agtron scores
older_df['agtron2_hold'] = older_df['x3'].map(str) + older_df['agtron'].map(str)

In [1374]:
#getting rid of roast data in agtron 1 column
older_df['agtron1_hold'].replace(r'.*(oast).*', '', inplace = True, regex=True)

In [1375]:
#getting rid of review data in agtron 2 column
older_df['agtron2_hold'].replace(r'.*(view).*', '', inplace = True, regex=True)

In [1376]:
#creating placeholder for agtron scores
older_df['agtron_combo'] = older_df['agtron2_hold'].map(str) + ' ' + older_df['agtron1_hold'].map(str)

In [1377]:
#splitting out agtron scores from headers
older_df[['agtron_combo','agtron_new']] = older_df['agtron_combo'].str.split(":", n=1, expand=True)

In [1378]:
#drop unneeded agtron hold columns
older_df.drop(['agtron_combo', 'agtron2_hold', 'agtron1_hold'], axis=1, inplace=True)

Extracting review info into its own column:

In [1379]:
#creating placeholder for 1st review scores
older_df['review1_hold'] = older_df['x3'].map(str) + older_df['agtron'].map(str)

In [1380]:
#creating placeholder for 2nd review scores
older_df['review2_hold'] = older_df['x4'].map(str) + older_df['est_price'].map(str)

In [1381]:
#creating placeholder for 3rd review scores
older_df['review3_hold'] = older_df['h1'].map(str) + older_df['s1'].map(str)

In [1382]:
#getting rid of agtron in review scores
older_df['review1_hold'].replace(r'.*(gtron).*', '', inplace = True, regex=True)

In [1383]:
#getting rid of price
older_df['review2_hold'].replace(r'.*(Price).*', '', inplace = True, regex=True)

In [1384]:
#getting rid of aroma
older_df['review2_hold'].replace(r'.*(roma).*', '', inplace = True, regex=True)

In [1385]:
#getting rid of aroma
older_df['review3_hold'].replace(r'.*(roma).*', '', inplace = True, regex=True)

In [1386]:
#getting rid of acidity
older_df['review3_hold'].replace(r'.*(cidity).*', '', inplace = True, regex=True)

In [1387]:
#getting rid of body
older_df['review3_hold'].replace(r'.*(ody).*', '', inplace = True, regex=True)

In [1388]:
#creating placeholder for review scores
older_df['review_combo'] = older_df['review1_hold'].map(str) + ' ' + older_df['review2_hold'].map(str) + ' ' + older_df['review3_hold'].map(str)

In [1389]:
#splitting out review scores
older_df[['review_combo','review']] = older_df['review_combo'].str.split(":", n=1, expand=True)

In [1390]:
#drop unneeded review hold columns
older_df.drop(['review_combo', 'review1_hold', 'review2_hold', 'review3_hold'], axis=1, inplace=True)

Extracting est_price info into its own column:

In [1391]:
#creating placeholder for 1st price scores
older_df['price1_hold'] = older_df['x4'].map(str) + older_df['est_price'].map(str)

In [1392]:
#getting rid of agtron in review scores
older_df['price1_hold'].replace(r'.*(eview).*', '', inplace = True, regex=True)

In [1393]:
#getting rid of agtron in review scores
older_df['price1_hold'].replace(r'.*(roma).*', '', inplace = True, regex=True)

In [1394]:
#splitting out price scores
older_df[['price1_hold','price']] = older_df['price1_hold'].str.split(":", n=1, expand=True)

In [1395]:
#drop unneeded price hold columns
older_df.drop(['price1_hold'], axis=1, inplace=True)

Extracting aroma values:

In [1396]:
#creating placeholder for 2nd review scores
older_df['aroma1_hold'] = older_df['x4'].map(str) + older_df['est_price'].map(str)

In [1397]:
#creating placeholder for 2nd review scores
older_df['aroma2_hold'] = older_df['h1'].map(str) + older_df['s1'].map(str)

In [1398]:
#creating placeholder for 2nd review scores
older_df['aroma3_hold'] = older_df['h2'].map(str) + older_df['s2'].map(str)

In [1399]:
#getting rid of review
older_df['aroma1_hold'].replace(r'.*(eview).*', '', inplace = True, regex=True)

In [1400]:
#getting rid of price
older_df['aroma1_hold'].replace(r'.*(rice).*', '', inplace = True, regex=True)

In [1401]:
#getting rid of review
older_df['aroma2_hold'].replace(r'.*(eview).*', '', inplace = True, regex=True)

In [1402]:
#getting rid of acidity
older_df['aroma2_hold'].replace(r'.*(cidity).*', '', inplace = True, regex=True)

In [1403]:
#getting rid of body
older_df['aroma2_hold'].replace(r'.*(ody).*', '', inplace = True, regex=True)

In [1404]:
#getting rid of acidity
older_df['aroma3_hold'].replace(r'.*(cidity).*', '', inplace = True, regex=True)

In [1405]:
#getting rid of body
older_df['aroma3_hold'].replace(r'.*(ody).*', '', inplace = True, regex=True)

In [1406]:
#getting rid of flavor
older_df['aroma3_hold'].replace(r'.*(lavor).*', '', inplace = True, regex=True)

In [1407]:
#creating placeholder for review scores
older_df['aroma_combo'] = older_df['aroma1_hold'].map(str) + older_df['aroma2_hold'].map(str) + older_df['aroma3_hold'].map(str)

In [1408]:
#extracting aroma ratings from combined column
older_df['aroma'] = older_df['aroma_combo'].str.extract(r'([0-9]+)')

In [1409]:
#drop unneeded aroma hold columns
older_df.drop(['aroma1_hold', 'aroma2_hold', 'aroma3_hold', 'aroma_combo'], axis=1, inplace=True)

In [1410]:
#updating cleaned score value data types to integer
older_df['aroma'] = older_df['aroma'].astype('Int8')

Extracting acidity into its own column:

In [1411]:
#creating placeholder for acidity scores
older_df['acidity1_hold'] = older_df['h1'].map(str) + older_df['s1'].map(str)

In [1412]:
#creating placeholder for 2nd acidity scores
older_df['acidity2_hold'] = older_df['h2'].map(str) + older_df['s2'].map(str)

In [1413]:
#creating placeholder for 3rd acidity scores
older_df['acidity3_hold'] = older_df['h3'].map(str) + older_df['s3'].map(str)

In [1414]:
#getting rid of aroma
older_df['acidity1_hold'].replace(r'.*(roma).*', '', inplace = True, regex=True)

In [1415]:
#getting rid of review
older_df['acidity1_hold'].replace(r'.*(eview).*', '', inplace = True, regex=True)

In [1416]:
#getting rid of body
older_df['acidity1_hold'].replace(r'.*(ody).*', '', inplace = True, regex=True)

In [1417]:
#getting rid of body
older_df['acidity2_hold'].replace(r'.*(ody).*', '', inplace = True, regex=True)

In [1418]:
#getting rid of aroma
older_df['acidity2_hold'].replace(r'.*(roma).*', '', inplace = True, regex=True)

In [1419]:
#getting rid of flavor
older_df['acidity2_hold'].replace(r'.*(lavor).*', '', inplace = True, regex=True)

In [1420]:
#getting rid of flavor
older_df['acidity3_hold'].replace(r'.*(lavor).*', '', inplace = True, regex=True)

In [1421]:
#getting rid of body
older_df['acidity3_hold'].replace(r'.*(ody).*', '', inplace = True, regex=True)

In [1422]:
#getting rid of with milk
older_df['acidity3_hold'].replace(r'.*(ilk).*', '', inplace = True, regex=True)

In [1423]:
#getting rid of aftertaste
older_df['acidity3_hold'].replace(r'.*(tast).*', '', inplace = True, regex=True)

In [1424]:
#creating placeholder for review scores
older_df['acidity_combo'] = older_df['acidity1_hold'].map(str) + older_df['acidity2_hold'].map(str) + older_df['acidity3_hold'].map(str)

In [1425]:
#extracting aroma ratings from combined column
older_df['acidity'] = older_df['acidity_combo'].str.extract(r'([0-9]+)')

In [1426]:
#drop unneeded price hold columns
older_df.drop(['acidity1_hold', 'acidity2_hold', 'acidity3_hold', 'acidity_combo'], axis=1, inplace=True)

In [1427]:
#updating cleaned score value data types to integer
older_df['acidity'] = older_df['acidity'].astype('Int8')

Extracting body into its own column:

In [1428]:
#creating placeholder for acidity scores
older_df['body1_hold'] = older_df['h1'].map(str) + older_df['s1'].map(str)

In [1429]:
#creating placeholder for acidity scores
older_df['body2_hold'] = older_df['h2'].map(str) + older_df['s2'].map(str)

In [1430]:
#creating placeholder for acidity scores
older_df['body3_hold'] = older_df['h3'].map(str) + older_df['s3'].map(str)

In [1431]:
#getting rid of aroma
older_df['body1_hold'].replace(r'.*(roma).*', '', inplace = True, regex=True)

In [1432]:
#getting rid of review
older_df['body1_hold'].replace(r'.*(eview).*', '', inplace = True, regex=True)

In [1433]:
#getting rid of acidity
older_df['body1_hold'].replace(r'.*(cidity).*', '', inplace = True, regex=True)

In [1434]:
#getting rid of acidity
older_df['body2_hold'].replace(r'.*(cidity).*', '', inplace = True, regex=True)

In [1435]:
#getting rid of acidity
older_df['body2_hold'].replace(r'.*(roma).*', '', inplace = True, regex=True)

In [1436]:
#getting rid of acidity
older_df['body2_hold'].replace(r'.*(lavor).*', '', inplace = True, regex=True)

In [1437]:
#getting rid of acidity
older_df['body3_hold'].replace(r'.*(lavor).*', '', inplace = True, regex=True)

In [1438]:
#getting rid of acidity
older_df['body3_hold'].replace(r'.*(cidity).*', '', inplace = True, regex=True)

In [1439]:
#getting rid of acidity
older_df['body3_hold'].replace(r'.*(ilk).*', '', inplace = True, regex=True)

In [1440]:
#creating placeholder for review scores
older_df['body_combo'] = older_df['body1_hold'].map(str) + older_df['body2_hold'].map(str) + older_df['body3_hold'].map(str)

In [1441]:
#extracting body ratings from combined column

older_df['body'] = older_df['body_combo'].str.extract(r'([0-9]+)')

In [1442]:
#drop unneeded price hold columns
older_df.drop(['body1_hold', 'body2_hold', 'body3_hold', 'body_combo'], axis=1, inplace=True)

In [1443]:
#updating cleaned score value data types to integer
older_df['body'] = older_df['body'].astype('Int8')

Extracting flavor into its own column:

In [1444]:
#creating placeholder for flavor scores
older_df['flavor1_hold'] = older_df['h2'].map(str) + older_df['s2'].map(str)

#creating placeholder for 2nd flavor scores
older_df['flavor2_hold'] = older_df['h3'].map(str) + older_df['s3'].map(str)

In [1445]:
#getting rid of aroma
older_df['flavor1_hold'].replace(r'.*(roma).*', '', inplace = True, regex=True)

In [1446]:
#getting rid of acidity
older_df['flavor1_hold'].replace(r'.*(cidity).*', '', inplace = True, regex=True)

In [1447]:
#getting rid of body
older_df['flavor1_hold'].replace(r'.*(ody).*', '', inplace = True, regex=True)

In [1448]:
#getting rid of aroma
older_df['flavor2_hold'].replace(r'.*(ody).*', '', inplace = True, regex=True)

In [1449]:
#getting rid of acidity
older_df['flavor2_hold'].replace(r'.*(cidity).*', '', inplace = True, regex=True)

In [1450]:
#getting rid of aroma
older_df['flavor2_hold'].replace(r'.*(tast).*', '', inplace = True, regex=True)

In [1451]:
#getting rid of aroma
older_df['flavor2_hold'].replace(r'.*(ilk).*', '', inplace = True, regex=True)

In [1452]:
#creating placeholder for agtron scores
older_df['flavor_combo'] = older_df['flavor1_hold'].map(str) + older_df['flavor2_hold'].map(str)

In [1453]:
#extracting body ratings from combined column

older_df['flavor'] = older_df['flavor_combo'].str.extract(r'([0-9]+)')

In [1454]:
#drop unneeded price hold columns
older_df.drop(['flavor1_hold', 'flavor2_hold', 'flavor_combo'], axis=1, inplace=True)

In [1455]:
#updating cleaned score value data types to integer
older_df['flavor'] = older_df['flavor'].astype('Int8')

Extracting aftertaste into its own column:

In [1456]:
#creating placeholder for flavor scores
older_df['after1_hold'] = older_df['h3'].map(str) + older_df['s3'].map(str)

In [1457]:
#getting rid of body
older_df['after1_hold'].replace(r'.*(ody).*', '', inplace = True, regex=True)

In [1458]:
#getting rid of acidity
older_df['after1_hold'].replace(r'.*(cidity).*', '', inplace = True, regex=True)

In [1459]:
#getting rid of flavor
older_df['after1_hold'].replace(r'.*(lavor).*', '', inplace = True, regex=True)

In [1460]:
#getting rid of with milk
older_df['after1_hold'].replace(r'.*(ilk).*', '', inplace = True, regex=True)

In [1461]:
#extracting body ratings from combined column

older_df['aftertaste'] = older_df['after1_hold'].str.extract(r'([0-9]+)')

In [1462]:
#drop unneeded price hold columns
older_df.drop(['after1_hold'], axis=1, inplace=True)

In [1463]:
#updating cleaned score value data types to integer
older_df['aftertaste'] = older_df['aftertaste'].astype('Int8')

Extracting with milk into its own column:

In [1464]:
#creating placeholder for flavor scores
older_df['milk1_hold'] = older_df['h3'].map(str) + older_df['s3'].map(str)

In [1465]:
#getting rid of body
older_df['milk1_hold'].replace(r'.*(ody).*', '', inplace = True, regex=True)

In [1466]:
#getting rid of acidity
older_df['milk1_hold'].replace(r'.*(cidity).*', '', inplace = True, regex=True)

In [1467]:
#getting rid of body
older_df['milk1_hold'].replace(r'.*(tast).*', '', inplace = True, regex=True)

In [1468]:
#getting rid of body
older_df['milk1_hold'].replace(r'(With Milk:Flavor in milk: 5).*', '5', inplace = True, regex=True)

In [1469]:
#getting rid of body
older_df['milk1_hold'].replace(r'(With Milk:Flavor in milk: 6).*', '6', inplace = True, regex=True)

In [1470]:
#getting rid of body
older_df['milk1_hold'].replace(r'(With Milk:Flavor in milk: 7).*', '7', inplace = True, regex=True)

In [1471]:
#getting rid of body
older_df['milk1_hold'].replace(r'.*(lavor).*', '', inplace = True, regex=True)

In [1472]:
#extracting body ratings from combined column

older_df['with_milk'] = older_df['milk1_hold'].str.extract(r'([0-9]+)')

In [1473]:
older_df['with_milk'].unique()

array([nan, '7', '8', '6', '5'], dtype=object)

In [1474]:
#drop unneeded price hold columns
older_df.drop(['milk1_hold'], axis=1, inplace=True)

In [1475]:
#updating cleaned score value data types to integer
older_df['with_milk'] = older_df['with_milk'].astype('Int8')

Dropping columns not needed now that data is extracted:

In [1476]:
#drop unneeded price hold columns
older_df.drop(['x1', 'coffee_origin', 'x2', 'roast_level', 'x3', 'agtron', 'x4', 'est_price', 'h1', 's1', 'h2', 's2', 'h3', 's3'], axis=1, inplace=True)

In [1477]:
older_df.to_csv('clean_older_df.csv', index=False)

### Splitting agtron into two columns, one for whole bean score and one for ground score

- The first number refers to the bean agtron score
- The second number refers to the ground agtron score

In [1478]:
#split agtron into 2 columns
older_df[['bean_agtron','ground_agtron']] = older_df['agtron_new'].str.split("/",expand=True)

In [1479]:
#verify split
older_df.head()

Unnamed: 0,coffee_name,roaster_name,overall_score,roaster_location,p1,p2,p3,origin,roast,agtron_new,review,price,aroma,acidity,body,flavor,aftertaste,with_milk,bean_agtron,ground_agtron
0,Doi Chaang Wild Civet Coffee,Doi Chaang Coffee,90,"Calgary, Alberta, Canada",Blind Assessment: Intriguing mid-tones through...,Notes: Doi Chaang is a single-estate coffee pr...,Who Should Drink It: Culinary adventurers who ...,Northern Thailand.,Medium-Light,49/80,June 2009,,8,8,7,,,,49,80
1,Kenya AA Lenana,Après Coffee,92,"Lancaster, Pennsylvania","Blind Assessment: Rich, very intense aroma: da...",Notes: Despite stresses brought on by social u...,Who Should Drink It: Those who prefer understa...,South-central Kenya,Medium-Dark,42/53,June 2009,,8,8,7,,,,42,53
2,Mele 100% Kona Coffee,Hula Daddy,92,"Holualoa, Hawaii","Blind Assessment: An exciting, rather unusual ...",Notes: A blend of coffees processed by a varie...,Who Should Drink It: An exhilarating sensory r...,"Holualoa, North Kona growing district, Hawaii.",Medium-Light,51/73,June 2009,,8,8,7,,,,51,73
3,Kenya Peaberry Thika Gethumbwini,JBC Coffee Roasters,96,"Madison, Wisconsin","Blind Assessment: Clean, complex, impeccable. ...",Notes: Despite national coffee leadership mark...,Who Should Drink It: Strikingly complete expre...,South-central Kenya,Light,57/90,June 2009,,8,9,8,,,,57,90
4,Kenya Gititu Peaberry,Atomic Cafe Coffee Roasters,92,"Beverly, Massachusetts","Blind Assessment: Rich, complex fruit aroma, i...",Notes: Despite stresses brought on by social u...,Who Should Drink It: Lovers of sweet fruit fla...,South-central Kenya,Medium,44/63,June 2009,,8,8,8,,,,44,63


In [1480]:
#drop unneeded agtron column
older_df.drop('agtron_new', axis=1, inplace=True)

In [1481]:
#checking drop
older_df.head()

Unnamed: 0,coffee_name,roaster_name,overall_score,roaster_location,p1,p2,p3,origin,roast,review,price,aroma,acidity,body,flavor,aftertaste,with_milk,bean_agtron,ground_agtron
0,Doi Chaang Wild Civet Coffee,Doi Chaang Coffee,90,"Calgary, Alberta, Canada",Blind Assessment: Intriguing mid-tones through...,Notes: Doi Chaang is a single-estate coffee pr...,Who Should Drink It: Culinary adventurers who ...,Northern Thailand.,Medium-Light,June 2009,,8,8,7,,,,49,80
1,Kenya AA Lenana,Après Coffee,92,"Lancaster, Pennsylvania","Blind Assessment: Rich, very intense aroma: da...",Notes: Despite stresses brought on by social u...,Who Should Drink It: Those who prefer understa...,South-central Kenya,Medium-Dark,June 2009,,8,8,7,,,,42,53
2,Mele 100% Kona Coffee,Hula Daddy,92,"Holualoa, Hawaii","Blind Assessment: An exciting, rather unusual ...",Notes: A blend of coffees processed by a varie...,Who Should Drink It: An exhilarating sensory r...,"Holualoa, North Kona growing district, Hawaii.",Medium-Light,June 2009,,8,8,7,,,,51,73
3,Kenya Peaberry Thika Gethumbwini,JBC Coffee Roasters,96,"Madison, Wisconsin","Blind Assessment: Clean, complex, impeccable. ...",Notes: Despite national coffee leadership mark...,Who Should Drink It: Strikingly complete expre...,South-central Kenya,Light,June 2009,,8,9,8,,,,57,90
4,Kenya Gititu Peaberry,Atomic Cafe Coffee Roasters,92,"Beverly, Massachusetts","Blind Assessment: Rich, complex fruit aroma, i...",Notes: Despite stresses brought on by social u...,Who Should Drink It: Lovers of sweet fruit fla...,South-central Kenya,Medium,June 2009,,8,8,8,,,,44,63


In [1482]:
#updating cleaned score value data types to integer
older_df['aroma'].value_counts()

8     831
7     467
9     226
6     153
5      68
4      22
10     15
3       3
Name: aroma, dtype: Int64

### Turning new agtron columns into numerical values

Cleaning up bean_agtron to integer type:

In [1483]:
#checking values to see which need updated to allow switch to integer
older_df['bean_agtron'].unique()

array(['49', '42', '51', '57', '44', '55', '53', '50', '41', '47', '46',
       '43', '58', '23', '37', '26', '32', '39', '0', '65', '54', '31',
       '52', '45', '48', '28', '59', '40', '63', '38', '27', '18', '36',
       '60', '69', '29', '75', '56', '62', '71', '76', '66', '64', '61',
       '20', '35', '21', '25', '34', '70', '33', '30', '22', '19', '24',
       '15', '14', '', '68', '67', 'g', '77', '17', '11', '43.5'],
      dtype=object)

In [1484]:
#turning blank into nans
older_df['bean_agtron'].replace('', np.nan, inplace=True)

In [1485]:
#turning 'g' into nans
older_df['bean_agtron'].replace('g', np.nan, inplace=True)

In [1486]:
#converting to float
older_df['bean_agtron'] = older_df['bean_agtron'].astype('float')

In [1487]:
#round numbers
older_df['bean_agtron'] = older_df['bean_agtron'].round()

In [1488]:
#converting to integer
older_df['bean_agtron'] = older_df['bean_agtron'].astype('Int8')

Cleaning up ground_agtron to integer type:

In [1489]:
#checking values to see which need updated to allow switch to integer
older_df['ground_agtron'].value_counts()

       211
66      45
63      44
0       41
58      41
      ... 
wb       1
81       1
25       1
28       1
68       1
Name: ground_agtron, Length: 134, dtype: int64

In [1490]:
#extracting body ratings from combined column

older_df['ground_agtron'] = older_df['ground_agtron'].str.extract(r'([0-9]+)')

In [1491]:
#converting to integer
older_df['ground_agtron'] = older_df['ground_agtron'].astype('Int8')

In [1492]:
#verifying data type updates
older_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1788 entries, 0 to 1787
Data columns (total 19 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   coffee_name       1788 non-null   object
 1   roaster_name      1788 non-null   object
 2   overall_score     1788 non-null   object
 3   roaster_location  1786 non-null   object
 4   p1                1788 non-null   object
 5   p2                1788 non-null   object
 6   p3                1788 non-null   object
 7   origin            1309 non-null   object
 8   roast             1788 non-null   object
 9   review            1788 non-null   object
 10  price             62 non-null     object
 11  aroma             1785 non-null   Int8  
 12  acidity           1452 non-null   Int8  
 13  body              1733 non-null   Int8  
 14  flavor            767 non-null    Int8  
 15  aftertaste        2 non-null      Int8  
 16  with_milk         9 non-null      Int8  
 17  bean_agtron   

### Turn overall score into integer

In [1493]:
#getting rid of aroma
older_df['overall_score'].replace(r'(NR)', '', inplace = True, regex=True)

In [1494]:
#turning blank into nans
older_df['overall_score'].replace('', np.nan, inplace=True)

In [1495]:
#converting to integer
older_df['overall_score'] = older_df['overall_score'].astype('Int8')

### Turn Roast_Level to Ordinal Values

Turning roast level to integers, ranging from 1-6, from Light to Very Dark.

In [1496]:
#checking value options
older_df['roast'].value_counts()

Medium-Dark     430
Medium          420
nan             278
Very Dark       234
Medium-Light    205
Dark            174
Light            47
Name: roast, dtype: int64

In [1497]:
#mapping values to numbers
roast_mapper = {'Light':1, 'Medium-Light':2, 'Medium':3, 'Medium-Dark':4, 'Dark':5, 'Very Dark':6}

In [1498]:
#updating values
older_df['roast'] = older_df['roast'].replace(roast_mapper)

In [1499]:
#turning blank into nans
older_df['roast'].replace('', np.nan, inplace=True)

In [1500]:
#changing to integers
older_df['roast'] = older_df['roast'].astype('float')

In [1501]:
#changing to integers
older_df['roast'] = older_df['roast'].astype('Int8')

In [1502]:
#checking update
older_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1788 entries, 0 to 1787
Data columns (total 19 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   coffee_name       1788 non-null   object
 1   roaster_name      1788 non-null   object
 2   overall_score     1784 non-null   Int8  
 3   roaster_location  1786 non-null   object
 4   p1                1788 non-null   object
 5   p2                1788 non-null   object
 6   p3                1788 non-null   object
 7   origin            1309 non-null   object
 8   roast             1510 non-null   Int8  
 9   review            1788 non-null   object
 10  price             62 non-null     object
 11  aroma             1785 non-null   Int8  
 12  acidity           1452 non-null   Int8  
 13  body              1733 non-null   Int8  
 14  flavor            767 non-null    Int8  
 15  aftertaste        2 non-null      Int8  
 16  with_milk         9 non-null      Int8  
 17  bean_agtron   

### Addressing the Estimated Price Column

Background: Estimated price data uses a number of different measures (ounces, grams, bottles, etc). Original plan was to break it apart with goal of converting to same measure so that they can be compared in a meaningful way. This would involve modifying data to get a price per ounce. In addition, the reviews span from 1997 to 2022 and it is unclear if the pricing data gets updated.

Given how complex it is to turn this column into usable data, the column is being dropped for now. 

In [1503]:
#drop est_price column 
older_df.drop('price', axis=1, inplace=True)

In [1504]:
older_df.head()

Unnamed: 0,coffee_name,roaster_name,overall_score,roaster_location,p1,p2,p3,origin,roast,review,aroma,acidity,body,flavor,aftertaste,with_milk,bean_agtron,ground_agtron
0,Doi Chaang Wild Civet Coffee,Doi Chaang Coffee,90,"Calgary, Alberta, Canada",Blind Assessment: Intriguing mid-tones through...,Notes: Doi Chaang is a single-estate coffee pr...,Who Should Drink It: Culinary adventurers who ...,Northern Thailand.,2,June 2009,8,8,7,,,,49,80
1,Kenya AA Lenana,Après Coffee,92,"Lancaster, Pennsylvania","Blind Assessment: Rich, very intense aroma: da...",Notes: Despite stresses brought on by social u...,Who Should Drink It: Those who prefer understa...,South-central Kenya,4,June 2009,8,8,7,,,,42,53
2,Mele 100% Kona Coffee,Hula Daddy,92,"Holualoa, Hawaii","Blind Assessment: An exciting, rather unusual ...",Notes: A blend of coffees processed by a varie...,Who Should Drink It: An exhilarating sensory r...,"Holualoa, North Kona growing district, Hawaii.",2,June 2009,8,8,7,,,,51,73
3,Kenya Peaberry Thika Gethumbwini,JBC Coffee Roasters,96,"Madison, Wisconsin","Blind Assessment: Clean, complex, impeccable. ...",Notes: Despite national coffee leadership mark...,Who Should Drink It: Strikingly complete expre...,South-central Kenya,1,June 2009,8,9,8,,,,57,90
4,Kenya Gititu Peaberry,Atomic Cafe Coffee Roasters,92,"Beverly, Massachusetts","Blind Assessment: Rich, complex fruit aroma, i...",Notes: Despite stresses brought on by social u...,Who Should Drink It: Lovers of sweet fruit fla...,South-central Kenya,3,June 2009,8,8,8,,,,44,63


### Turning `review_date` into datetime type and extracting month and year.

In [1505]:
#turning into data time format
older_df['review'] = pd.to_datetime(older_df['review'])

In [1506]:
#extracting year
older_df['year'] = pd.DatetimeIndex(older_df['review']).year

In [1507]:
#extracting month
older_df['month'] = pd.DatetimeIndex(older_df['review']).month

In [1508]:
#dropping unneeded original datetime column
older_df.drop('review', axis=1, inplace=True)

In [1511]:
older_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1788 entries, 0 to 1787
Data columns (total 19 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   coffee_name       1788 non-null   object
 1   roaster_name      1788 non-null   object
 2   overall_score     1784 non-null   Int8  
 3   roaster_location  1786 non-null   object
 4   p1                1788 non-null   object
 5   p2                1788 non-null   object
 6   p3                1788 non-null   object
 7   origin            1309 non-null   object
 8   roast             1510 non-null   Int8  
 9   aroma             1785 non-null   Int8  
 10  acidity           1452 non-null   Int8  
 11  body              1733 non-null   Int8  
 12  flavor            767 non-null    Int8  
 13  aftertaste        2 non-null      Int8  
 14  with_milk         9 non-null      Int8  
 15  bean_agtron       1549 non-null   Int8  
 16  ground_agtron     1554 non-null   Int8  
 17  year          

### Turn locations into latitude and longitude

In [1518]:
older_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1788 entries, 0 to 1787
Data columns (total 19 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   coffee_name       1788 non-null   object
 1   roaster_name      1788 non-null   object
 2   overall_score     1784 non-null   Int8  
 3   roaster_location  1786 non-null   object
 4   p1                1788 non-null   object
 5   p2                1788 non-null   object
 6   p3                1788 non-null   object
 7   origin            1309 non-null   object
 8   roast             1510 non-null   Int8  
 9   aroma             1785 non-null   Int8  
 10  acidity           1452 non-null   Int8  
 11  body              1733 non-null   Int8  
 12  flavor            767 non-null    Int8  
 13  aftertaste        2 non-null      Int8  
 14  with_milk         9 non-null      Int8  
 15  bean_agtron       1549 non-null   Int8  
 16  ground_agtron     1554 non-null   Int8  
 17  year          

In [1519]:
#hashed out so don't rerun on accident


#using open street map to get lat and long for roaster_location

#import requests
#import urllib.parse

#lat = []
#lon = []

#address_list = older_df['roaster_location'].to_numpy()

#for address in address_list:
    #url = 'https://nominatim.openstreetmap.org/search/' + urllib.parse.quote(str(address)) +'?format=json'

    #try:
        #response = requests.get(url).json()
        #lat.append(response[0]["lat"])
        #lon.append(response[0]["lon"])
   # except IndexError:
        #lat.append('')
        #lon.append('')

#print(lat)
#print(lon)

['51.0460954', '40.03813', '19.6238149', '43.074761', '42.5584284', '35.996653', '43.5569174', '33.6170092', '45.0165728', '39.100105', '49.2433804', '42.4850931', '44.9772995', '42.2681569', '35.996653', '49.2433804', '38.6319657', '22.6203348', '32.8401623', '43.074761', '37.7840208', '35.3187279', '44.337125349999994', '34.007135', '45.0165728', '48.202158', '47.0451022', '42.2681569', '42.2417669', '37.3893889', '37.2333253', '33.7489924', '38.6319657', '34.4458248', '47.6038321', '49.2373251', '13.7524938', '24.992517', '44.337125349999994', '42.3039199', '45.0165728', '34.1729044', '40.03813', '29.6519684', '48.1625483', '40.420418999999995', '45.5202471', '3.3254169', '35.996653', '42.5584284', '41.6214825', '44.337125349999994', '35.2272086', '39.049011', '39.049011', '37.3893889', '37.7790262', '37.8753497', '38.4404925', '', '37.2333253', '37.7790262', '37.7790262', '37.2333253', '37.8044557', '', '38.4404925', '45.0165728', '42.5584284', '44.337125349999994', '38.6319657', '

In [1520]:
#checking lengths to support adding into dataframe
len(lat)

1788

In [1521]:
len(lon)

1788

In [1522]:
len(older_df)

1788

In [1523]:
#adding column to dataframe
older_df['roaster_location_lat'] = lat

In [1524]:
#adding column to dataframe
older_df['roaster_location_lon'] = lon

In [1786]:
#drop original roaster_location being replaced with lat and lon
data_df.drop('roaster_location', axis=1, inplace=True)

In [1526]:
#spot checking accuracy of lat locations
older_df.sample(10)

Unnamed: 0,coffee_name,roaster_name,overall_score,roaster_location,p1,p2,p3,origin,roast,aroma,...,body,flavor,aftertaste,with_milk,bean_agtron,ground_agtron,year,month,roaster_location_lat,roaster_location_lon
1759,Maxwell House Master Blend,Kraft Foods,65,"Tarrytown, New York",Blind Assessment: Low-toned to the point of fl...,Notes: A cheap coffee which is dramatically in...,Who Should Drink It: Those for whom (literally...,,2.0,4,...,7,4.0,,,,63.0,1997,5,41.0762077,-73.8587461
1626,Bella Vista Tres Rios,Starbucks Coffee,87,"Seattle, Washington",Blind Assessment: Liveliest and most resonant ...,"Notes: Though dark-roasted, not overbearingly ...",Who Should Drink It: Those who need proof that...,,5.0,7,...,6,7.0,,,39.0,40.0,1998,6,47.6038321,-122.330062
1460,"Finca La Tacita, Estate Peaberry",Caravan Coffee,90,"Newberg, Oregon",Blind Assessment: Full yet majestically buoyan...,Notes: Full yet majestically buoyant. The arom...,Who Should Drink It: Perhaps what we all yearn...,,,8,...,8,9.0,,,,,1999,11,45.300347,-122.972751
39,Ethiopia Yirgacheffe Beloya Selection 8,Barrington Coffee Roasting Co.,95,"Lee, Massachusetts",Blind Assessment: Grandly fruity and sweetly f...,"Notes: A dry-processed or ""natural"" coffee, me...",Who Should Drink It: Admirers of fine ports an...,"Yirgacheffe growing region, Sidamo Province, s...",1.0,9,...,8,,,,65.0,91.0,2009,4,42.3039199,-73.2340974
377,Hacienda La Esmeralda,Willoughby's Coffee & Tea,94,"Branford, Connecticut","Blind Assessment: Sweet-toned, quietly intense...",Notes: Coffee from trees of the botanical vari...,Who Should Drink It: Those who can afford one ...,"Boquete growing region, western Panama",2.0,8,...,8,,,,54.0,76.0,2007,11,41.2795414,-72.8150989
1266,Capricorn Blend Decaffeinated,Capricorn Coffees,83,"San Francisco, California",Blind Assessment: Dark roast in the classic bi...,Notes: The coffees making up this blend were d...,Who Should Drink It: An agreeable dark roast f...,,,7,...,6,8.0,,,0.0,0.0,2002,7,37.7790262,-122.419906
1035,Organic Peruvian Rainforest,Caffe Ibis,85,"Logan, Utah","Blind Assessment: In the nose sweetly pungent,...",Notes: This coffee is grown on the eastern slo...,Who Should Drink It: Those socially and enviro...,Northern Peru,4.0,8,...,7,,,,44.0,50.0,2004,3,41.7313447,-111.834863
1039,Peruvian Fair-Trade Organic,Alterra Coffee Roasters,88,"Milwaukee, Wisconsin",Blind Assessment: In the nose intensely roasty...,Notes: Produced by the 150 farmers of the Coch...,Who Should Drink It: This hugely roasty but ro...,"Quillabamba region, south-central Peru.",5.0,8,...,7,8.0,,,39.0,42.0,2004,3,43.0349931,-87.922497
1410,Organic Timor,Uncommon Grounds,93,"Berkeley, California","Blind Assessment: The quintessentially smooth,...","Notes: The quintessentially smooth, deep, rich...",Who Should Drink It: Should be delicious brewe...,,,8,...,8,10.0,,,,,2000,4,37.8753497,-122.23963364918777
489,Ethiopia Biloya Special,Paradise Roasters,97,"Ramsey, Minnesota","Blind Assessment: Intense, sweet-toned, chocol...",Notes: Most southern Ethiopia coffee is prepar...,"Who Should Drink It: Lovers of big, complexly ...",Southern Ethiopia,2.0,9,...,9,,,,57.0,78.0,2007,5,45.0165728,-93.0949501


### Cleaning up country names to return more lat and long for coffee origin
There were lots of missing values when running with original data. Cleaning up origin names to improve return rate of lat and lon. Lat and long will be a bit less precise given replacing in many spots with country names.


In [1579]:
older_df['origin'].unique()

array(['Thailand', 'Kenya', 'Hawaii', 'Rwanda', 'Burundi', 'Tanzania',
       'Ethiopia', 'Colombia', 'Indonesia', 'Not disclosed.', 'Brazil',
       'El Salvador', 'Yemen', 'South America, East Africa', 'India',
       'Papua New Guinea', 'Panama', 'Guatemala',
       'Africa, Latin America.', 'Central America.',
       'Central and South America.', 'South and Central America.',
       'Honduras', 'Costa Rica', 'Mexico', 'Bolivia', 'Nicaragua',
       'Northern Rivers region, New South Wales, Australia', 'Zimbabwe',
       'Peru', 'Jamaica', 'Puerto Rico',
       'Northern Sumatra, probably Aceh Province', 'Latin America.',
       'Northern Sumatra.', 'Central America; South America.', 'Haiti.',
       'East Timor.', 'Central America and Africa', 'East Africa.',
       'Not disclosed', 'Northern Sumatra', None, 'Blend',
       '20% Kona; other blend components not disclosed.',
       'Central and South America', 'Not Disclosed',
       'East, Central, and southern Africa', 'Zambia',
 

In [1580]:
older_df['origin'].value_counts()

Not disclosed.                                     178
Ethiopia                                           112
Colombia                                           112
Indonesia                                           98
Not disclosed                                       78
                                                  ... 
20% Kona; other blend components not disclosed.      1
Latin America.                                       1
Central America; South America.                      1
Haiti.                                               1
Java                                                 1
Name: origin, Length: 69, dtype: int64

In [1534]:
older_df['origin'].replace(r'.*(Ethiopia).*', 'Ethiopia', inplace = True, regex=True)

In [1535]:
older_df['origin'].replace(r'.*(Panama).*', 'Panama', inplace = True, regex=True)

In [1536]:
older_df['origin'].replace(r'.*(El Salvador).*', 'El Salvador', inplace = True, regex=True)

In [1537]:
older_df['origin'].replace(r'.*(Kenya).*', 'Kenya', inplace = True, regex=True)

In [1538]:
older_df['origin'].replace(r'.*(Guatemala).*', 'Guatemala', inplace = True, regex=True)

In [1539]:
older_df['origin'].replace(r'.*(Colombia).*', 'Colombia', inplace = True, regex=True)

In [1540]:
older_df['origin'].replace(r'.*(Burundi).*', 'Burundi', inplace = True, regex=True)

In [1541]:
older_df['origin'].replace(r'.*(Costa R.ca).*', 'Costa Rica', inplace = True, regex=True)

In [1542]:
older_df['origin'].replace(r'.*(Honduras).*', 'Honduras', inplace = True, regex=True)

In [1543]:
older_df['origin'].replace(r'.*(Rwanda).*', 'Rwanda', inplace = True, regex=True)

In [1544]:
older_df['origin'].replace(r'.*(Hawaii).*', 'Hawaii', inplace = True, regex=True)

In [1545]:
older_df['origin'].replace(r'.*(Hawai).*', 'Hawaii', inplace = True, regex=True)

In [1546]:
older_df['origin'].replace(r'.*(Peru).*', 'Peru', inplace = True, regex=True)

In [1548]:
older_df['origin'].replace(r'.*(India).*', 'India', inplace = True, regex=True)

In [1549]:
older_df['origin'].replace(r'.*(Brazil).*', 'Brazil', inplace = True, regex=True)

In [1550]:
older_df['origin'].replace(r'.*(Ecuador).*', 'Ecuador', inplace = True, regex=True)

In [1806]:
older_df['origin'].replace(r'.*(Haiti).*', 'Haiti', inplace = True, regex=True)

In [1551]:
older_df['origin'].replace(r'.*(Puerto Rico).*', 'Puerto Rico', inplace = True, regex=True)

In [1552]:
older_df['origin'].replace(r'.*(Uganda).*', 'Uganda', inplace = True, regex=True)

In [1553]:
older_df['origin'].replace(r'.*(Nicaragua).*', 'Nicaragua', inplace = True, regex=True)

In [1554]:
older_df['origin'].replace(r'.*(Papua New Guinea).*', 'Papua New Guinea', inplace = True, regex=True)

In [1555]:
older_df['origin'].replace(r'.*(Thailand).*', 'Thailand', inplace = True, regex=True)

In [1556]:
older_df['origin'].replace(r'.*(Indonesia).*', 'Indonesia', inplace = True, regex=True)

In [1557]:
older_df['origin'].replace(r'.*(Yemen).*', 'Yemen', inplace = True, regex=True)

In [1558]:
older_df['origin'].replace(r'.*(Jamaica).*', 'Jamaica', inplace = True, regex=True)

In [1559]:
older_df['origin'].replace(r'.*(Tanzania).*', 'Tanzania', inplace = True, regex=True)

In [1560]:
older_df['origin'].replace(r'.*(Mexico).*', 'Mexico', inplace = True, regex=True)

In [1561]:
older_df['origin'].replace(r'.*(Central Zambia).*', 'Central Zambia', inplace = True, regex=True)

In [1562]:
older_df['origin'].replace(r'.*(Taiwan).*', 'Taiwan', inplace = True, regex=True)

In [1563]:
older_df['origin'].replace(r'.*(Democratic Republic of the Congo).*', 'Democratic Republic of the Congo', inplace = True, regex=True)

In [1564]:
older_df['origin'].replace(r'.*(Bolivia).*', 'Bolivia', inplace = True, regex=True)

In [1565]:
older_df['origin'].replace(r'.*(Dominican Republic).*', 'Dominican Republic', inplace = True, regex=True)

In [1566]:
older_df['origin'].replace(r'.*(China).*', 'China', inplace = True, regex=True)

In [1567]:
older_df['origin'].replace(r'.*(Zimbabwe).*', 'Zimbabwe', inplace = True, regex=True)

In [1568]:
older_df['origin'].replace(r'.*(Philippines).*', 'The Philippines', inplace = True, regex=True)

In [1569]:
older_df['origin'].replace(r'.*(Malawi).*', 'Malawi', inplace = True, regex=True)

In [1570]:
older_df['origin'].replace(r'.*(Vietnam).*', 'Vietnam', inplace = True, regex=True)

In [1571]:
older_df['origin'].replace(r'.*(Zambia).*', 'Zambia', inplace = True, regex=True)

In [1572]:
older_df['origin'].replace(r'.*(Malaysia).*', 'Malaysia', inplace = True, regex=True)

In [1573]:
older_df['origin'].replace(r'.*(Democratic Republic of Congo).*', 'Democratic Republic of the Congo', inplace = True, regex=True)

In [1574]:
older_df['origin'].replace(r'.*(Democratic Republic of Congo).*', 'Democratic Republic of the Congo', inplace = True, regex=True)

In [1575]:
older_df['origin'].replace(r'.*(Laos).*', 'Laos', inplace = True, regex=True)

In [1576]:
older_df['origin'].replace(r'.*(Myanmar).*', 'Myanmar', inplace = True, regex=True)

In [1577]:
older_df['origin'].replace(r'.*(DRC Congo).*', 'Democratic Republic of the Congo', inplace = True, regex=True)

In [1578]:
older_df['origin'].replace(r'.*(California).*', 'California', inplace = True, regex=True)

In [1583]:
#####hashed out so don't rerun on accident

#getting lat and lon for coffee_origin locations

lat_origin = []
lon_origin = []

address_list_origin = older_df['origin'].to_numpy()

for address in address_list_origin:
    url = 'https://nominatim.openstreetmap.org/search/' + urllib.parse.quote(str(address)) +'?format=json'

    try:
        response = requests.get(url).json()
        lat_origin.append(response[0]["lat"])
        lon_origin.append(response[0]["lon"])
    except IndexError:
        lat_origin.append('')
        lon_origin.append('')

print(lat_origin)
print(lon_origin)

['14.8971921', '1.4419683', '19.593801499999998', '1.4419683', '1.4419683', '-1.9646631', '-1.9646631', '-3.426449', '-3.426449', '-1.9646631', '-6.5247123', '-1.9646631', '-1.9646631', '-1.9646631', '-3.426449', '-6.5247123', '-3.426449', '10.2116702', '4.099917', '-2.4833826', '10.2116702', '', '-1.9646631', '', '', '', '', '-10.3333333', '', '', '10.2116702', '', '13.8000382', '', '', '16.3471243', '', '10.2116702', '', '10.2116702', '22.3511148', '-5.6816069', '-5.6816069', '-5.6816069', '-5.6816069', '-5.6816069', '-5.6816069', '-5.6816069', '-5.6816069', '1.4419683', '10.2116702', '4.099917', '-10.3333333', '1.4419683', '8.559559', '', '-10.3333333', '', '10.2116702', '15.5855545', '13.8000382', '15.5855545', '-10.3333333', '10.2116702', '10.2116702', '8.559559', '10.2116702', '16.3471243', '10.2116702', '4.099917', '', '-1.9646631', '-2.4833826', '-30.29284845', '', '', '10.2116702', '21.417531150000002', '', '', '', '16.3471243', '-1.9646631', '1.4419683', '', '10.2116702', '-2

In [1584]:
#adding column to dataframe
older_df['origin_lat'] = lat_origin

In [1585]:
#adding column to dataframe
older_df['origin_lon'] = lon_origin

In [1588]:
#drop unneeded agtron column
older_df.drop('origin', axis=1, inplace=True)

In [1589]:
#verifying accuracy of column adds
older_df.sample(5)

Unnamed: 0,coffee_name,roaster_name,overall_score,roaster_location,p1,p2,p3,roast,aroma,acidity,...,aftertaste,with_milk,bean_agtron,ground_agtron,year,month,roaster_location_lat,roaster_location_lon,origin_lat,origin_lon
1474,Ethiopia Harrar,Alaska Coffee Roasting Company,86,"Fairbanks, Alaska","Blind Assessment: Smoky, resiny tones turn the...",Notes: Alaska may be the home to more fine cof...,"Who Should Drink It: An adventurer 's coffee, ...",,8,6,...,,,,,1999,9,64.837845,-147.716675,44.933143,7.540121
105,Breakfast Blend,Tully's Coffee,88,"Seattle, Washington","Blind Assessment: Round, fruit-toned aroma, wi...",Notes: Tully's is a large quality-oriented spe...,Who Should Drink It: A gently balanced but lus...,4.0,8,7,...,,,42.0,51.0,2009,1,47.6038321,-122.330062,,
114,Full City Roast (K-Cup),Tully's Coffee,87,"Seattle, Washington",Blind Assessment: (As brewed in a Keurig B60 s...,Notes: Keurig brewing devices were among the f...,Who Should Drink It: A balanced cup with a res...,6.0,8,7,...,,,0.0,33.0,2009,1,47.6038321,-122.330062,,
213,Brazil Daterra Sunrise,Crescent Moon Coffee & Tea,92,"Mickleton, New Jersey","Blind Assessment: Sweet-toned, deeply pungent ...",Notes: This coffee is Rainforest Alliance cert...,Who Should Drink It: Those who avoid intensity...,3.0,9,7,...,,,51.0,69.0,2008,9,39.7901134,-75.2376834,-10.3333333,-53.2
1649,Espresso Blend,Calistoga Roastery,71,"Calistoga, California",Blind Assessment: Virtually all sweetness has ...,Notes: Virtually all sweetness has been driven...,Who Should Drink It: At best this is a coffee ...,,4,5,...,,,,,1998,5,38.5787966,-122.579705,44.933143,7.540121


In [1592]:
#turning blank into nans
older_df['roaster_location_lat'].replace('', np.nan, inplace=True)

In [1594]:
older_df['roaster_location_lat'] = older_df['roaster_location_lat'].astype(float)

In [1596]:
#turning blank into nans
older_df['roaster_location_lon'].replace('', np.nan, inplace=True)

In [1597]:
older_df['roaster_location_lon'] = older_df['roaster_location_lon'].astype(float)

In [1598]:
#turning blank into nans
older_df['origin_lat'].replace('', np.nan, inplace=True)

In [1599]:
older_df['origin_lat'] = older_df['origin_lat'].astype(float)

In [1600]:
#turning blank into nans
older_df['origin_lon'].replace('', np.nan, inplace=True)

In [1601]:
older_df['origin_lon'] = older_df['origin_lon'].astype(float)

In [1604]:
older_df.head()

Unnamed: 0,coffee_name,roaster_name,overall_score,roaster_location,p1,p2,p3,roast,aroma,acidity,...,aftertaste,with_milk,bean_agtron,ground_agtron,year,month,roaster_location_lat,roaster_location_lon,origin_lat,origin_lon
0,Doi Chaang Wild Civet Coffee,Doi Chaang Coffee,90,"Calgary, Alberta, Canada",Blind Assessment: Intriguing mid-tones through...,Notes: Doi Chaang is a single-estate coffee pr...,Who Should Drink It: Culinary adventurers who ...,2,8,8,...,,,49,80,2009,6,51.046095,-114.065465,14.897192,100.83273
1,Kenya AA Lenana,Après Coffee,92,"Lancaster, Pennsylvania","Blind Assessment: Rich, very intense aroma: da...",Notes: Despite stresses brought on by social u...,Who Should Drink It: Those who prefer understa...,4,8,8,...,,,42,53,2009,6,40.03813,-76.305669,1.441968,38.431398
2,Mele 100% Kona Coffee,Hula Daddy,92,"Holualoa, Hawaii","Blind Assessment: An exciting, rather unusual ...",Notes: A blend of coffees processed by a varie...,Who Should Drink It: An exhilarating sensory r...,2,8,8,...,,,51,73,2009,6,19.623815,-155.953638,19.593801,-155.42837
3,Kenya Peaberry Thika Gethumbwini,JBC Coffee Roasters,96,"Madison, Wisconsin","Blind Assessment: Clean, complex, impeccable. ...",Notes: Despite national coffee leadership mark...,Who Should Drink It: Strikingly complete expre...,1,8,9,...,,,57,90,2009,6,43.074761,-89.383761,1.441968,38.431398
4,Kenya Gititu Peaberry,Atomic Cafe Coffee Roasters,92,"Beverly, Massachusetts","Blind Assessment: Rich, complex fruit aroma, i...",Notes: Despite stresses brought on by social u...,Who Should Drink It: Lovers of sweet fruit fla...,3,8,8,...,,,44,63,2009,6,42.558428,-70.880049,1.441968,38.431398


In [1852]:
#checking data type updates
data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4779 entries, 0 to 4778
Data columns (total 25 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   coffee_name           4779 non-null   object 
 1   roaster_name          4779 non-null   object 
 2   roast_level           4705 non-null   Int8   
 3   overall_score         4779 non-null   int64  
 4   h1                    4779 non-null   object 
 5   s1                    4779 non-null   int64  
 6   h2                    4779 non-null   object 
 7   s2                    4779 non-null   int64  
 8   h3                    4779 non-null   object 
 9   s3                    4779 non-null   int64  
 10  h4                    4779 non-null   object 
 11  s4                    4779 non-null   int64  
 12  h5                    4779 non-null   object 
 13  s5                    4779 non-null   int64  
 14  p1                    4779 non-null   object 
 15  p2                   

In [1606]:
older_df.to_csv('2nd_clean_data_df.csv', index=False)

In [None]:
### clean up notebook
### combine with other data, continue cleaning

In [None]:
##########
# To Do

# - run a model with this initial info, pre-tokenizing

# - see if currency column gets updated or if prices are old (1997) - if eventually we want to keep

# - address second set of data

# - decide what to do with coffee names and roaster names
# - text tokenizing for paragraphs - AFTER joining all data together
# - text tokenizing for names??? - AFTER joining all data together

###########