In [22]:
import pandas as pd
import numpy as np

# Data Wrangling

In [23]:
data = pd.read_csv(r"train.csv")

### Summary of the Beer Dataset

This dataset is for a Data Science case-study where the goal is to predict the overall rating of a beer using a machine learning model.

**Data:**

* Each row represents review of the Beer.

**Target Variable:**

* The target variable is `review/overall`, which is a numerical value between 1.0 and 5.0 representing the user's overall rating of the beer.

**Features:**

* The dataset contains various features that can potentially influence the overall rating:
    * Beer information:
        * `beer/name`: Name of the beer.
        * `beer/style`: Style of the beer (e.g., IPA, Stout).
        * `beer/ABV`: Alcohol content of the beer by volume.
        * `beer/beerId`: Unique identifier for the beer reviewed.
        * `beer/brewerId`: Unique identifier for the brewery.
    * Review information:
        * `review/appearance`: Rating of the beer's appearance (1.0 to 5.0).
        * `review/aroma`: Rating of the beer's aroma (1.0 to 5.0).
        * `review/palate`: Rating of the beer's palate (1.0 to 5.0).
        * `review/taste`: Rating of the beer's taste (1.0 to 5.0).
        * `review/text`: Textual description of the review.
        * `review/timeStruct`: Dictionary containing information about the review submission time.
        * `review/timeUnix`: Unix timestamp of the review submission.
    * User information:
        * `user/ageInSeconds`: Age of the user in seconds.
        * `user/birthdayUnix`: User birthday information (raw and Unix timestamp).
        * `user/gender`: Gender of the user (if provided).
        * `user/profileName`: Username of the reviewer (if provided).

### Issues With Dataset
1. Dirty Data
    - Missing Values in `["review/text", "user/ageInSeconds", "user/birthdayRaw", "user/birthdayUnix", "user/gender", "user/profileName"]` columns. - 'Completion Issue'
    - The dtype of `["beer/name", "beer/style", "user/gender"]` is object, It should be category. - 'Validity Issue'
    - `user/birthdayRaw` column isn't in datetime format. - 'Validity Issue'
    - The dtype of Columns `["beer/ABV", "review/appearance", "review/aroma", "review/overall", "review/palate", "review/taste"]` are float64 even though float16 is enough. - 'Validity Issue'
    - `"user/birthdayRaw"` Column have Full Date of Birth. We need Year of Birth Only because day and month may not be a factor of `"review/overall"`. - 'Validity Issue'
2. Messy Data
    - `review/timeStruct` Column contains `["min", "hour", "mday", "sec", "year", "wday", "mon", "isdst", "yday"]` Together.
    - `user/birthdayRaw` column should have different columns `user/birthdayDay`, `user/birthdayMonth`, `user/birthdayYear`.
    - There should be different columns for `"beer/name"` and `"beer/style"` as per their Categories.

##### Finding More Issues

In [24]:
data.head()

Unnamed: 0,index,beer/ABV,beer/beerId,beer/brewerId,beer/name,beer/style,review/appearance,review/aroma,review/overall,review/palate,review/taste,review/text,review/timeStruct,review/timeUnix,user/ageInSeconds,user/birthdayRaw,user/birthdayUnix,user/gender,user/profileName
0,40163,5.0,46634,14338,Chiostro,Herbed / Spiced Beer,4.0,4.0,4.0,4.0,4.0,Pours a clouded gold with a thin white head. N...,"{'min': 38, 'hour': 3, 'mday': 16, 'sec': 10, ...",1229398690,,,,,RblWthACoz
1,8135,11.0,3003,395,Bearded Pat's Barleywine,American Barleywine,4.0,3.5,3.5,3.5,3.0,12oz bottle into 8oz snifter.\t\tDeep ruby red...,"{'min': 38, 'hour': 23, 'mday': 8, 'sec': 58, ...",1218238738,,,,,BeerSox
2,10529,4.7,961,365,Naughty Nellie's Ale,American Pale Ale (APA),3.5,4.0,3.5,3.5,3.5,First enjoyed at the brewpub about 2 years ago...,"{'min': 7, 'hour': 18, 'mday': 26, 'sec': 2, '...",1101492422,,,,Male,mschofield
3,44610,4.4,429,1,Pilsner Urquell,Czech Pilsener,3.0,3.0,2.5,3.0,3.0,First thing I noticed after pouring from green...,"{'min': 7, 'hour': 1, 'mday': 20, 'sec': 5, 'y...",1308532025,1209827000.0,"Aug 10, 1976",208508400.0,Male,molegar76
4,37062,4.4,4904,1417,Black Sheep Ale (Special),English Pale Ale,4.0,3.0,3.0,3.5,2.5,A: pours an amber with a one finger head but o...,"{'min': 51, 'hour': 6, 'mday': 12, 'sec': 48, ...",1299912708,,,,,Brewbro000


In [25]:
data.tail()

Unnamed: 0,index,beer/ABV,beer/beerId,beer/brewerId,beer/name,beer/style,review/appearance,review/aroma,review/overall,review/palate,review/taste,review/text,review/timeStruct,review/timeUnix,user/ageInSeconds,user/birthdayRaw,user/birthdayUnix,user/gender,user/profileName
37495,35175,5.5,22450,3268,Blackberry Scottish-Style,Fruit / Vegetable Beer,4.0,3.5,3.5,3.5,3.5,12 oz brown longneck with no freshness dating....,"{'min': 56, 'hour': 23, 'mday': 10, 'sec': 1, ...",1207871761,,,,,Redrover
37496,23666,8.5,7463,1199,Founders Dirty Bastard,Scotch Ale / Wee Heavy,4.5,4.0,3.5,4.5,4.5,A - A bright red with a maroon-amber hue; mini...,"{'min': 45, 'hour': 5, 'mday': 10, 'sec': 14, ...",1263102314,,,,,jmerloni
37497,47720,4.75,1154,394,Stoudt's Fest,MÃ¤rzen / Oktoberfest,4.0,3.5,4.0,4.5,4.0,Sampled on tap at Redbones.\t\tThis marzen sty...,"{'min': 3, 'hour': 1, 'mday': 25, 'sec': 36, '...",1067043816,,,,,UncleJimbo
37498,33233,11.2,19960,1199,Founders KBS (Kentucky Breakfast Stout),American Double / Imperial Stout,4.0,4.0,4.0,5.0,5.0,Pours a black body with a brown head that very...,"{'min': 52, 'hour': 19, 'mday': 29, 'sec': 33,...",1296330753,,,,,Stockfan42
37499,23758,8.5,7463,1199,Founders Dirty Bastard,Scotch Ale / Wee Heavy,4.0,4.0,4.0,4.5,4.0,"A nice sweet, malty beer...nothing complex, ju...","{'min': 40, 'hour': 18, 'mday': 4, 'sec': 28, ...",1252089628,,,,,JayQue


In [26]:
data.isnull().sum()

index                    0
beer/ABV                 0
beer/beerId              0
beer/brewerId            0
beer/name                0
beer/style               0
review/appearance        0
review/aroma             0
review/overall           0
review/palate            0
review/taste             0
review/text             10
review/timeStruct        0
review/timeUnix          0
user/ageInSeconds    29644
user/birthdayRaw     29644
user/birthdayUnix    29644
user/gender          22186
user/profileName         5
dtype: int64

In [27]:
data.duplicated().sum()

0

#### Viewing Random Rows for 10 to 15 Times to find another Issues, if any
#### No More Issues Found

In [28]:
data.sample(3)

Unnamed: 0,index,beer/ABV,beer/beerId,beer/brewerId,beer/name,beer/style,review/appearance,review/aroma,review/overall,review/palate,review/taste,review/text,review/timeStruct,review/timeUnix,user/ageInSeconds,user/birthdayRaw,user/birthdayUnix,user/gender,user/profileName
11148,31599,9.8,24905,1199,Founders Curmudgeon (Old Ale),Old Ale,4.0,4.5,4.0,4.0,4.0,A - Dark amber with light trails of carbonatio...,"{'min': 45, 'hour': 0, 'mday': 9, 'sec': 7, 'y...",1223513107,,,,,jeonseh
22959,13469,9.5,32780,12224,Aiko,Euro Strong Lager,3.5,3.0,3.0,3.0,3.0,Appearance: An amber hue with little to no hea...,"{'min': 59, 'hour': 1, 'mday': 12, 'sec': 26, ...",1229047166,1100873000.0,"Jan 23, 1980",317462400.0,Male,colts9016
19426,13351,9.5,40176,12224,Lobster Lovers Beer,Euro Strong Lager,2.0,4.0,3.0,2.0,3.0,A: Cloudy deep orange. Little head that audibl...,"{'min': 48, 'hour': 5, 'mday': 1, 'sec': 22, '...",1317448102,,,,,bluHatter


In [29]:
data.describe()

Unnamed: 0,index,beer/ABV,beer/beerId,beer/brewerId,review/appearance,review/aroma,review/overall,review/palate,review/taste,review/timeUnix,user/ageInSeconds,user/birthdayUnix
count,37500.0,37500.0,37500.0,37500.0,37500.0,37500.0,37500.0,37500.0,37500.0,37500.0,7856.0,7856.0
mean,24951.887573,7.403725,21861.152027,3036.59512,3.900053,3.87324,3.88944,3.854867,3.92244,1232794000.0,1176705000.0,241630300.0
std,14434.009669,2.318145,18923.130832,5123.084675,0.588778,0.680865,0.70045,0.668068,0.716504,71909550.0,337551400.0,337551400.0
min,0.0,0.1,175.0,1.0,0.0,1.0,0.0,1.0,1.0,926294400.0,703436600.0,-2208960000.0
25%,12422.5,5.4,5441.0,395.0,3.5,3.5,3.5,3.5,3.5,1189194000.0,979481000.0,143362800.0
50%,24942.5,6.9,17538.0,1199.0,4.0,4.0,4.0,4.0,4.0,1248150000.0,1100009000.0,318326400.0
75%,37416.75,9.4,34146.0,1315.0,4.5,4.5,4.5,4.5,4.5,1291330000.0,1274973000.0,438854400.0
max,49999.0,57.7,77207.0,27797.0,5.0,5.0,5.0,5.0,5.0,1326267000.0,3627295000.0,714898800.0


In [30]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37500 entries, 0 to 37499
Data columns (total 19 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   index              37500 non-null  int64  
 1   beer/ABV           37500 non-null  float64
 2   beer/beerId        37500 non-null  int64  
 3   beer/brewerId      37500 non-null  int64  
 4   beer/name          37500 non-null  object 
 5   beer/style         37500 non-null  object 
 6   review/appearance  37500 non-null  float64
 7   review/aroma       37500 non-null  float64
 8   review/overall     37500 non-null  float64
 9   review/palate      37500 non-null  float64
 10  review/taste       37500 non-null  float64
 11  review/text        37490 non-null  object 
 12  review/timeStruct  37500 non-null  object 
 13  review/timeUnix    37500 non-null  int64  
 14  user/ageInSeconds  7856 non-null   float64
 15  user/birthdayRaw   7856 non-null   object 
 16  user/birthdayUnix  785

## Data Cleaning

In [31]:
beer = data.copy()
beer.sample(3)

Unnamed: 0,index,beer/ABV,beer/beerId,beer/brewerId,beer/name,beer/style,review/appearance,review/aroma,review/overall,review/palate,review/taste,review/text,review/timeStruct,review/timeUnix,user/ageInSeconds,user/birthdayRaw,user/birthdayUnix,user/gender,user/profileName
7907,24917,6.9,23474,1199,Founders RÃ¼bÃ¦us,Fruit / Vegetable Beer,4.0,4.0,4.5,3.0,4.0,"This brew poured almost flat, but tasting reve...","{'min': 32, 'hour': 16, 'mday': 21, 'sec': 59,...",1219336379,,,,,Mgrad92
11399,30027,7.2,5441,1199,Founders Centennial IPA,American IPA,4.0,4.5,4.0,4.0,4.5,Poured into a pint glass. No bottled on or bb ...,"{'min': 12, 'hour': 0, 'mday': 24, 'sec': 46, ...",1295827966,,,,,HOPPYKC
1610,12220,4.6,178,60,Strike Out Stout,English Stout,4.5,3.5,4.0,3.5,4.0,The appearance is jet black with a tall dark t...,"{'min': 52, 'hour': 1, 'mday': 2, 'sec': 50, '...",1122947570,,,,,Naes


### Define
* Unwanted Features
    - Drop `"beer/beerId"` and `"index"` Column because It is Useless Column for Prediction.
    - Drop `"review/timeUnix"` Column beacuse "review/timeStruct" and "review/timeUnix" are dependent on each other.
    - Drop `"user/birthdayUnix"` Column beacuse "user/birthdayRaw" and "user/birthdayUnix" are dependent on each other.
    - Drop `"user/ageInSeconds"` Column because "user/birthdayRaw" and "user/ageInSeconds" are dependent on each other.
    - Drop `"user/profileName"` Column because It is Useless Column.
* Messy Data
    - Do `One Hot Encoding` on `"beer/name"` and `"beer/style"` in "`/FeatureEngineering.ipynb`"
    - Extract JSON from `"review/timeStruct"` and make Columns as their Key. -- in "`/FeatureEngineering.ipynb`"
    - Make `["min", "hour", "mday", "sec", "year", "wday", "mon", "isdst", "yday"]` in Category dtype. -- in "`/FeatureEngineering_2.ipynb`"
* Completion Issue
    - Drop `"review/text"` Rows where values are Missing because only `0.026%` of The Data is Missing.
    - Fill Missing Values of `"user/birthdayRaw"` with 0.
    - Fill Missing Values of `"user/gender"` Using Model-based imputation - Multinomial Naive Bayes(Because in input columns there is more textual data and naive bayes is good for textual data.) -- in "`/FeatureEngineering.ipynb`"
    - After Filling Missing Values of `"user/gender"` Using Model-based imputation, All the Predictions are Same(Male) which is Baised towards one category. because weightage of Male is More Than Female. So, Now Fill it with string `"Random"`.
* Validity Issue
    - Extract `Year` From `"user/birthdayRaw"` and Delete Original Column in order to avoid Overfitting.

In [32]:
beer.drop(columns=["index", "beer/beerId", "review/timeUnix",
          "user/birthdayUnix", "user/ageInSeconds", "user/profileName"], inplace=True)
beer.dropna(subset=["review/text"], inplace=True)

In [33]:
beer.isnull().sum()

beer/ABV                 0
beer/brewerId            0
beer/name                0
beer/style               0
review/appearance        0
review/aroma             0
review/overall           0
review/palate            0
review/taste             0
review/text              0
review/timeStruct        0
user/birthdayRaw     29634
user/gender          22182
dtype: int64

In [34]:
def extract_year(dob):
    if type(dob) == str:
        return np.int16(dob[-4:])
    return dob


beer["user/birthdayRaw"] = beer["user/birthdayRaw"].apply(extract_year)

In [35]:
beer.info()

<class 'pandas.core.frame.DataFrame'>
Index: 37490 entries, 0 to 37499
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   beer/ABV           37490 non-null  float64
 1   beer/brewerId      37490 non-null  int64  
 2   beer/name          37490 non-null  object 
 3   beer/style         37490 non-null  object 
 4   review/appearance  37490 non-null  float64
 5   review/aroma       37490 non-null  float64
 6   review/overall     37490 non-null  float64
 7   review/palate      37490 non-null  float64
 8   review/taste       37490 non-null  float64
 9   review/text        37490 non-null  object 
 10  review/timeStruct  37490 non-null  object 
 11  user/birthdayRaw   7856 non-null   float64
 12  user/gender        15308 non-null  object 
dtypes: float64(7), int64(1), object(5)
memory usage: 4.0+ MB


In [36]:
beer[["beer/name", "beer/style", "user/gender"]
     ] = beer[["beer/name", "beer/style", "user/gender"]].astype("category")

beer.info()

<class 'pandas.core.frame.DataFrame'>
Index: 37490 entries, 0 to 37499
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype   
---  ------             --------------  -----   
 0   beer/ABV           37490 non-null  float64 
 1   beer/brewerId      37490 non-null  int64   
 2   beer/name          37490 non-null  category
 3   beer/style         37490 non-null  category
 4   review/appearance  37490 non-null  float64 
 5   review/aroma       37490 non-null  float64 
 6   review/overall     37490 non-null  float64 
 7   review/palate      37490 non-null  float64 
 8   review/taste       37490 non-null  float64 
 9   review/text        37490 non-null  object  
 10  review/timeStruct  37490 non-null  object  
 11  user/birthdayRaw   7856 non-null   float64 
 12  user/gender        15308 non-null  category
dtypes: category(3), float64(7), int64(1), object(2)
memory usage: 3.4+ MB


#### Percentage of missing data in "user/birthdayRaw" column

In [37]:
beer["user/birthdayRaw"].isna().sum() * 100 / beer.shape[0]

79.04507868765003

#### Fill missing values with 0, Do not drop because it can be a factor for prediction

In [38]:
beer["user/birthdayRaw"].fillna(0, inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  beer["user/birthdayRaw"].fillna(0, inplace=True)


In [39]:
beer

Unnamed: 0,beer/ABV,beer/brewerId,beer/name,beer/style,review/appearance,review/aroma,review/overall,review/palate,review/taste,review/text,review/timeStruct,user/birthdayRaw,user/gender
0,5.00,14338,Chiostro,Herbed / Spiced Beer,4.0,4.0,4.0,4.0,4.0,Pours a clouded gold with a thin white head. N...,"{'min': 38, 'hour': 3, 'mday': 16, 'sec': 10, ...",0.0,
1,11.00,395,Bearded Pat's Barleywine,American Barleywine,4.0,3.5,3.5,3.5,3.0,12oz bottle into 8oz snifter.\t\tDeep ruby red...,"{'min': 38, 'hour': 23, 'mday': 8, 'sec': 58, ...",0.0,
2,4.70,365,Naughty Nellie's Ale,American Pale Ale (APA),3.5,4.0,3.5,3.5,3.5,First enjoyed at the brewpub about 2 years ago...,"{'min': 7, 'hour': 18, 'mday': 26, 'sec': 2, '...",0.0,Male
3,4.40,1,Pilsner Urquell,Czech Pilsener,3.0,3.0,2.5,3.0,3.0,First thing I noticed after pouring from green...,"{'min': 7, 'hour': 1, 'mday': 20, 'sec': 5, 'y...",1976.0,Male
4,4.40,1417,Black Sheep Ale (Special),English Pale Ale,4.0,3.0,3.0,3.5,2.5,A: pours an amber with a one finger head but o...,"{'min': 51, 'hour': 6, 'mday': 12, 'sec': 48, ...",0.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
37495,5.50,3268,Blackberry Scottish-Style,Fruit / Vegetable Beer,4.0,3.5,3.5,3.5,3.5,12 oz brown longneck with no freshness dating....,"{'min': 56, 'hour': 23, 'mday': 10, 'sec': 1, ...",0.0,
37496,8.50,1199,Founders Dirty Bastard,Scotch Ale / Wee Heavy,4.5,4.0,3.5,4.5,4.5,A - A bright red with a maroon-amber hue; mini...,"{'min': 45, 'hour': 5, 'mday': 10, 'sec': 14, ...",0.0,
37497,4.75,394,Stoudt's Fest,MÃ¤rzen / Oktoberfest,4.0,3.5,4.0,4.5,4.0,Sampled on tap at Redbones.\t\tThis marzen sty...,"{'min': 3, 'hour': 1, 'mday': 25, 'sec': 36, '...",0.0,
37498,11.20,1199,Founders KBS (Kentucky Breakfast Stout),American Double / Imperial Stout,4.0,4.0,4.0,5.0,5.0,Pours a black body with a brown head that very...,"{'min': 52, 'hour': 19, 'mday': 29, 'sec': 33,...",0.0,


In [40]:
beer["user/birthdayRaw"].value_counts().count()

55

In [41]:
beer.info()

<class 'pandas.core.frame.DataFrame'>
Index: 37490 entries, 0 to 37499
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype   
---  ------             --------------  -----   
 0   beer/ABV           37490 non-null  float64 
 1   beer/brewerId      37490 non-null  int64   
 2   beer/name          37490 non-null  category
 3   beer/style         37490 non-null  category
 4   review/appearance  37490 non-null  float64 
 5   review/aroma       37490 non-null  float64 
 6   review/overall     37490 non-null  float64 
 7   review/palate      37490 non-null  float64 
 8   review/taste       37490 non-null  float64 
 9   review/text        37490 non-null  object  
 10  review/timeStruct  37490 non-null  object  
 11  user/birthdayRaw   37490 non-null  float64 
 12  user/gender        15308 non-null  category
dtypes: category(3), float64(7), int64(1), object(2)
memory usage: 3.4+ MB


## Exporting Data for EDA

In [42]:
beer.to_csv("beer.csv", index=False)

# Now GoTo "`EDA.ipynb`"