# Data cleaning challenge: which product do people like best?

In this challenge, you will take the role of a data scientist. You'll be given some data on customer reviews for 3 products (Products A, B, and C) and you'll have to clean it to be able to run your company's graphing code to see which product is best.

### Necessary files:
* There is a file in the `datasets` folder called 'product_tests.csv'. This contains data from 100 customer ratings each of Products A, B, and C. Each customer has a unique user id and rated one of the products on a scale from 0-5. (0 is the worst, 5 is the best) 
* There is a script that runs your company's graphing code called `compare_products.py`. This script will make a graph to help figure out which product customers like best. **This script reads in a file called 'products_clean.csv' in the datasets folder. Your overall job is to clean the data to make this file!**


**First, import the `product_tests.csv` file using pandas and assign it to a variable** (remember to import pandas too)

In [1]:
import pandas as pd 

# Use import pandas to run a csv file containing a dataframe

In [2]:
product_tests = pd.read_csv('../datasets/product_tests.csv')

In [3]:
product_tests

Unnamed: 0.1,Unnamed: 0,product,rating,user_id
0,0,Product A,4.340998,Y5JgC1
1,1,Product A,,GRHQYF
2,2,Product A,2.363216,EZ96Fa
3,3,Product A,,MzRCo4
4,4,Product A,4.987896,VnVWvM
...,...,...,...,...
295,95,Product C,4.332348,IkyryZ
296,96,Product C,4.531547,
297,97,Product C,3.733014,shIkm7
298,98,Product C,,4UFkhB


### Your data cleaning goals:

Your goal is to make this 'products_clean.csv' file a cleaned datafile. Here are the steps you should take to make sure the data are clean

1. Remove any rows where ratings (values in the `rating` column) are below 0 or above 5. These would be impossible scores so these should be removed.

# Reassign name of dataframe to rating_removal & remove any rows that have values in the rating column below 0 or above 5

In [11]:
rating_removal = product_tests[product_tests['rating'] >= 0] 
rating_removal = rating_removal[rating_removal['rating'] < 5]

In [12]:
rating_removal

Unnamed: 0.1,Unnamed: 0,product,rating,user_id
0,0,Product A,4.340998,Y5JgC1
2,2,Product A,2.363216,EZ96Fa
4,4,Product A,4.987896,VnVWvM
5,5,Product A,0.256108,uyTYq1
6,6,Product A,0.254752,6hiPYk
...,...,...,...,...
294,94,Product C,2.183499,C3cTCd
295,95,Product C,4.332348,IkyryZ
296,96,Product C,4.531547,
297,97,Product C,3.733014,shIkm7


2. There are some rows where the user_id is missing. Replace these with the string 'unknown user' for each missing user_id. We don't know the user id, but maybe we can still analyze these data points!

# For all rows under column labeled user_id that have missing data, replaced empty slots with string 'unknown user' 

In [14]:
rating_removal['user_id'][rating_removal['user_id'].isnull()]='unknown user'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  rating_removal['user_id'][rating_removal['user_id'].isnull()]='unknown user'


3. Filter out any rows where `product` or `rating` are missing. We can't analyze data if we don't know which product it was, or what the rating was!

# used the .isnull function to filter out any rows where values for product or rating are missing/empty, .sum function pulls all similar rows with missing values

In [17]:
rating_removal = rating_removal[-rating_removal['product'].isnull()]
rating_removal = rating_removal[-rating_removal['rating'].isnull()]

In [18]:
rating_removal.isnull().sum()

Unnamed: 0    0
product       0
rating        0
user_id       0
dtype: int64

4. Rename the `rating` column to `user_rating` and the `product` column to `product_id`. The company's code is built to use these standardized column names

# used the .rename function to rename rating column to user_rating and product column to product_id. placed new values in dictionaries and used inplace = True to store the values

In [21]:
rating_removal.rename(columns = {'rating':'user_rating', 'product':'product_id'}, inplace = True)

5. Once you've done all these steps, export the data to `jtc_class_code/datasets/products_clean.csv`

Make sure that the csv is named exactly this way in your folder, because the graphing code relies on this exact file path!

# Exported the data back into a directory folder with the title products_clean as a csv file.

In [22]:
rating_removal.to_csv('../datasets/products_clean.csv')

### Comparing the products

Once you've finished, run:
```console 
$ python compare_products.py
``` 

from the command line, and if the code runs smoothly, you'll see a file called `product_chart.png` pop up to help you decide which product customers like best. 

Which product do you think is highest-rated?

If you don't get it on the first try, don't worry! Try to use the error messages you see, and take a look at your `products_clean.csv` file to see what is being output to help you guide your data cleaning process 

## Finished and got the plot? Decided which product is highest-rated? 

#### Congrats on finishing the data cleaning challenge! Data cleaning is not easy! 

So, remember to comment all your code and push this notebook to Github