# Transforming Data into Features

You are a data scientist at a clothing company and are working with a data set of customer reviews. This dataset is originally from [Kaggle](https://www.kaggle.com/datasets/nicapotato/womens-ecommerce-clothing-reviews) and has a lot of potential for various machine learning purposes. You are tasked with transforming some of these features to make the data more useful for analysis. To do this, you will have time to practice the following:
    - Transforming categorical data
    - Scaling your data
    - Working with date-time features

In [108]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

##  1.

- Let’s start with some basic exploring by performing the following:
- First, import your dataset. It is stored under a file named `reviews.csv`. 

In [109]:
df = pd.read_csv('reviews.csv')
df.head()

Unnamed: 0,clothing_id,age,review_title,review_text,recommended,division_name,department_name,review_date,rating
0,1095,39,"Cute,looks like a dress on",If you are afraid of the jumpsuit trend but li...,True,General,Dresses,2019-07-08,Liked it
1,1095,28,"So cute, great print!",I love fitted top dresses like this but i find...,True,General,Dresses,2019-05-17,Loved it
2,699,37,So flattering!,"I love these cozy, fashionable leggings. they ...",True,Initmates,Intimate,2019-06-24,Loved it
3,1072,36,Effortless,"Another reviewer said it best, ""i love the way...",True,General Petite,Dresses,2019-12-06,Loved it
4,1094,32,You need this!,Rompers are my fav so i'm biased writing this ...,True,General,Dresses,2019-10-04,Loved it


##  2.

- Next, we want to look at the column names of our dataset along with their data types. 
- Do the following two steps:
    - Print the column names of your dataset.
    - Check your features’ data types by printing `.info()`.

In [110]:
df.columns

Index(['clothing_id', 'age', 'review_title', 'review_text', 'recommended',
       'division_name', 'department_name', 'review_date', 'rating'],
      dtype='object')

In [111]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   clothing_id      5000 non-null   int64 
 1   age              5000 non-null   int64 
 2   review_title     4174 non-null   object
 3   review_text      4804 non-null   object
 4   recommended      5000 non-null   bool  
 5   division_name    4996 non-null   object
 6   department_name  4996 non-null   object
 7   review_date      5000 non-null   object
 8   rating           5000 non-null   object
dtypes: bool(1), int64(2), object(6)
memory usage: 317.5+ KB


##  3.

- Transform the `recommended` feature. 
- Start by printing the feature’s `.value_counts()`.

In [112]:
df.recommended.value_counts()

recommended
True     4166
False     834
Name: count, dtype: int64

##  4.

- Since this is a True/False feature, we want to transform it to 1 for True and 0 for False.
- To do this, create a dictionary called `binary_dict` where:
    - The keys are what is currently in the `recommended` feature.
    - The values are what we want in the new column (0s and 1s).

In [113]:
binary_dict = {True: 1, False: 0}

##  5.

- Using `binary_dict`, transform the `recommended` column so that it will now be binary. 
- Print the results using .`value_counts()` to confirm the transformation.

In [114]:
df.recommended = df.recommended.map(binary_dict)
df.recommended.value_counts()

recommended
1    4166
0     834
Name: count, dtype: int64

##  6.

- Let’s run through a similar process to transform the `rating` feature. 
- This is ordinal data so our transformation should make that more clear. 
- Again, start by printing the `.value_counts()`.

In [115]:
df.rating.value_counts()

rating
Loved it     2798
Liked it     1141
Was okay      564
Not great     304
Hated it      193
Name: count, dtype: int64

##  7.

- We want to make the following changes to the values:
    - ‘Loved it’ → 5
    - ‘Liked it’ → 4
    - ‘Was okay’ → 3
    - ‘Not great’ → 2
    - ‘Hated it’ → 1
- Create a dictionary called `rating_dict` where the keys are what is currently in the feature and the values are what we want in the new column. 
- You can use the hierarchy listed above to make your dictionary.

In [116]:
rating_dict = {"Loved it": 5, "Liked it": 4, "Was okay": 3, "Not great": 2, "Hated it": 1}

##  8.

- Using `rating_dict`, transform the rating column so it contains numerical values. 
- Print the results using `.value_counts()` to confirm the transformation.

In [117]:
df.rating = df.rating.map(rating_dict)
df.rating.value_counts()

rating
5    2798
4    1141
3     564
2     304
1     193
Name: count, dtype: int64

##  9.

- Let’s now transform the `department_name` feature. 
- This process will be slightly different, but start by printing the `.value_counts()` of the feature.
    1. Use Panda’s `get_dummies` to one-hot encode our feature.
    2. Attach the results back to our original data frame.
    3. Print the column names to see!

In [118]:
df.department_name.value_counts()

department_name
Tops        2196
Dresses     1322
Bottoms      848
Intimate     378
Jackets      224
Trend         28
Name: count, dtype: int64

##  10.

- Use panda’s `get_dummies()` method to one-hot encode our feature. Assign this to a variable called `one_hot`.

In [119]:
one_hot = pd.get_dummies(df.department_name)

##  11.

- Join the results from `one_hot` back to our original data frame. 
- Then print out the column names. 
- What has been added?

In [120]:
df = df.join(one_hot)
df.head()

Unnamed: 0,clothing_id,age,review_title,review_text,recommended,division_name,department_name,review_date,rating,Bottoms,Dresses,Intimate,Jackets,Tops,Trend
0,1095,39,"Cute,looks like a dress on",If you are afraid of the jumpsuit trend but li...,1,General,Dresses,2019-07-08,4,False,True,False,False,False,False
1,1095,28,"So cute, great print!",I love fitted top dresses like this but i find...,1,General,Dresses,2019-05-17,5,False,True,False,False,False,False
2,699,37,So flattering!,"I love these cozy, fashionable leggings. they ...",1,Initmates,Intimate,2019-06-24,5,False,False,True,False,False,False
3,1072,36,Effortless,"Another reviewer said it best, ""i love the way...",1,General Petite,Dresses,2019-12-06,5,False,True,False,False,False,False
4,1094,32,You need this!,Rompers are my fav so i'm biased writing this ...,1,General,Dresses,2019-10-04,5,False,True,False,False,False,False


##  12.

- Transform the `review_date` feature.
- This feature is listed as an object type, but we want this to be transformed into a date-time feature.
    - Transform `review_date` into a date-time feature.
    - Print the feature type to confirm the transformation.

In [121]:
df.review_date = pd.to_datetime(df.review_date, utc=True)
df.review_date.dtype

datetime64[ns, UTC]

##  13.

- The final step we will take in our transformation project is scaling our data. 
- We notice that we have a wide range of numbers thus far, so it is best to put everything on the same scale.
- Let’s get our data frame to only have the numerical features we created. 

In [122]:
df = df[['clothing_id', 'age', 'recommended', 'rating', 'Bottoms', 'Dresses', 'Intimate', 'Jackets', 'Tops', 'Trend']].copy()

##  14.

- Reset the index to be our `clothing_id` feature.

In [123]:
df = df.set_index('clothing_id')

##  15.

- Perform a `.fit_transform()` on our data set, and print the results to see how the features have changed.

In [124]:
scaler = StandardScaler()
scaled_arr = scaler.fit_transform(df)

scaled_df = pd.DataFrame(scaled_arr, columns=df.columns, index=df.index)
scaled_df.head()

Unnamed: 0_level_0,age,recommended,rating,Bottoms,Dresses,Intimate,Jackets,Tops,Trend
clothing_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1095,-0.348145,0.447428,-0.189648,-0.451928,1.667977,-0.285977,-0.216567,-0.884967,-0.075044
1095,-1.244752,0.447428,0.716025,-0.451928,1.667977,-0.285977,-0.216567,-0.884967,-0.075044
699,-0.511164,0.447428,0.716025,-0.451928,-0.599529,3.496786,-0.216567,-0.884967,-0.075044
1072,-0.592674,0.447428,0.716025,-0.451928,1.667977,-0.285977,-0.216567,-0.884967,-0.075044
1094,-0.918713,0.447428,0.716025,-0.451928,1.667977,-0.285977,-0.216567,-0.884967,-0.075044


In [127]:
for col in df.columns:
    print(f'{col} mean: {np.round(scaled_df[col].mean(), 2)}')
    print(f'{col} std: {np.round(scaled_df[col].std(), 2)}')

age mean: -0.0
age std: 1.0
recommended mean: -0.0
recommended std: 1.0
rating mean: 0.0
rating std: 1.0
Bottoms mean: 0.0
Bottoms std: 1.0
Dresses mean: -0.0
Dresses std: 1.0
Intimate mean: -0.0
Intimate std: 1.0
Jackets mean: 0.0
Jackets std: 1.0
Tops mean: 0.0
Tops std: 1.0
Trend mean: -0.0
Trend std: 1.0
