# Review Cleaner 

This is the last step. Usually there are many *.csv* files containing raw review files because Review Scraper may stop or I don't have time to scrape everything at once. In such case I save .csv files with indexed names and then concat everything in this notebook.

In [1]:
import pandas as pd
import os
import re

In [2]:
review_files = os.listdir("your_reviews_folder")

# Raw Files

In [3]:
## Getting a list with all needed files(paths).

review_files = [("tv_reviews/" + i) for i in review_files]
review_files

['tv_reviews/tv_reviews_0_17_raw.csv',
 'tv_reviews/tv_reviews_18_raw.csv',
 'tv_reviews/tv_reviews_19_22raw.csv',
 'tv_reviews/tv_reviews_22_24raw.csv',
 'tv_reviews/tv_reviews_25_28raw.csv',
 'tv_reviews/tv_reviews_29raw.csv',
 'tv_reviews/tv_reviews_30_34raw.csv',
 'tv_reviews/tv_reviews_35_40raw.csv',
 'tv_reviews/tv_reviews_3raw.csv',
 'tv_reviews/tv_reviews_41_44raw.csv',
 'tv_reviews/tv_reviews_45_50raw.csv',
 'tv_reviews/tv_reviews_51_52raw.csv',
 'tv_reviews/tv_reviews_53raw.csv',
 'tv_reviews/tv_reviews_54_56raw.csv',
 'tv_reviews/tv_reviews_57_62raw.csv',
 'tv_reviews/tv_reviews_63_67raw.csv',
 'tv_reviews/tv_reviews_68_77raw.csv']

In [4]:
## Concataneting all .csv files

all_reviews = pd.concat([pd.read_csv(i) for i in review_files])

In [5]:
all_reviews

Unnamed: 0,Rating,Title,Date,Helpful,Unhelpful,Review,Product_Name,Page
0,5,Small and powerful!,"Nov 22, 2021 11:58 AM",(20),(0),I bought this to have it in my office! and it’...,"Insignia™ - 32"" Class F20 Series LED HD Smart ...",1
1,5,Good brand,"Dec 28, 2019 5:44 PM",(157),(42),32 inch Insignia is a good quality tv. Ultra l...,"Insignia™ - 32"" Class F20 Series LED HD Smart ...",1
2,4,Good TV but not GREAT!,"Apr 20, 2021 2:40 PM",(24),(8),I am 95% happy so far with this tv aside from ...,"Insignia™ - 32"" Class F20 Series LED HD Smart ...",1
3,5,Tv,"Dec 15, 2020 7:12 PM",(39),(3),So far excellent quality and I love all the fe...,"Insignia™ - 32"" Class F20 Series LED HD Smart ...",1
4,5,great for a bedroom,"May 1, 2021 6:03 PM",(18),(5),I chose to buy the Insignia 32in fire tv for m...,"Insignia™ - 32"" Class F20 Series LED HD Smart ...",1
...,...,...,...,...,...,...,...,...
9724,3,Not getting Dolby Atmos.,"Mar 16, 2021 3:06 AM",(0),(0),Received a TV today. Happy with the picture qu...,"Sony - 55"" Class X800H Series LED 4K UHD Smart...",45
9725,1,Horrible picture quality,"Feb 21, 2021 12:38 AM",(0),(1),"I bought this to replace an 8 yr old 52"" Samsu...","Sony - 55"" Class X800H Series LED 4K UHD Smart...",45
9726,1,Sharp Picture,"Sep 21, 2020 5:21 AM",(0),(7),The picture on this tv is so sharp and the col...,"Sony - 55"" Class X800H Series LED 4K UHD Smart...",45
9727,2,Sound drops out. Consider another TV.,"Nov 28, 2020 2:43 AM",(0),(0),Picture looks nice. But the sound frequently d...,"Sony - 55"" Class X800H Series LED 4K UHD Smart...",46


In [7]:
## Droping duplicated rows outright
all_reviews.drop_duplicates(inplace=True)

**Cheking for thousands separators.**

In [9]:
all_reviews[all_reviews.Unhelpful.str.len() > 4]

Unnamed: 0,Rating,Title,Date,Helpful,Unhelpful,Review,Product_Name,Page
21252,5,Best camera for beginners,"Dec 29, 2019 1:12 AM",(24),(392),"My son loves the camera, He hasn't put it down...","Samsung - 40"" Class 5 Series LED Full HD Smart...",2
54542,1,Thanks for the terrific Christmas,"Dec 25, 2018 10:40 PM",(13),(101),"Screen was broken, took it out of the box and ...",Insignia™ - 55” Class LED 4K UHD Smart Fire TV...,3
59415,2,Not 50 inches,"Nov 6, 2020 9:46 PM",(6),(110),It’s says 50inch but we jst measured and it’s ...,"Insignia™ - 50"" Class F30 Series LED 4K UHD Sm...",4
18,4,Mounting screws for wall mount too long,"Apr 25, 2020 2:00 AM",(57),(131),I was excited to get this TV. I had purchased...,"Samsung - 50"" Class 7 Series LED 4K UHD Smart ...",1
8,1,Kid broke glass with one hit,"Jul 10, 2020 4:19 PM",(10),(232),I wish you had something to put over the scree...,"Samsung - 55"" Class Q60T Series QLED 4K UHD Sm...",1
3102,1,Don' buy,"Jun 13, 2021 7:31 PM",(16),(137),OLED55B6P\n\nHorrible burn in. It happen few m...,"LG - 55"" Class C1 Series OLED 4K UHD Smart web...",28


In [10]:
## Extracting numbers inside of parantheses

all_reviews["Helpful"] = all_reviews.Helpful.str.extract("\((.+)\)")
all_reviews["Unhelpful"] = all_reviews.Unhelpful.str.extract("\((.+)\)")

In [12]:
## Parsing number as integers. If there are any exceptions something went wrong
## NA values on titles are OK

all_reviews.astype({"Helpful" : "int64",
                   "Unhelpful" : "int64"}).info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 201583 entries, 0 to 9728
Data columns (total 8 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   Rating        201583 non-null  int64 
 1   Title         201544 non-null  object
 2   Date          201583 non-null  object
 3   Helpful       201583 non-null  int64 
 4   Unhelpful     201583 non-null  int64 
 5   Review        201583 non-null  object
 6   Product_Name  201583 non-null  object
 7   Page          201583 non-null  int64 
dtypes: int64(4), object(4)
memory usage: 13.8+ MB


In [13]:
all_reviews.reset_index(drop=True,inplace=True)

In [17]:
## Here I try to examine duplicated reviews. Some products share reviews
## in this case all of them should be removed. Of course this depends on your aim of your project as well.

all_reviews[all_reviews.duplicated(["Review"], keep=False)]

Unnamed: 0,Rating,Title,Date,Helpful,Unhelpful,Review,Product_Name,Page
319,5,Office TV,"Jan 29, 2022 10:56 PM",0,0,Great Tv lots of features and great looking c...,"Insignia™ - 32"" Class F20 Series LED HD Smart ...",16
375,5,"I really like this product, it does exactly wh...","Dec 27, 2021 6:21 PM",0,0,"I really like this product, it does exactly wh...","Insignia™ - 32"" Class F20 Series LED HD Smart ...",19
475,5,Works great in RV,"Aug 16, 2021 7:26 PM",0,0,We bought this TV to replace one in our RV. I...,"Insignia™ - 32"" Class F20 Series LED HD Smart ...",24
496,5,Great tv,"Dec 29, 2021 10:06 PM",0,0,My son loved his new TV. Will definitely recom...,"Insignia™ - 32"" Class F20 Series LED HD Smart ...",25
777,5,Great,"Oct 20, 2021 1:49 AM",0,0,"Great product so far, I really love using them...","Insignia™ - 32"" Class F20 Series LED HD Smart ...",39
...,...,...,...,...,...,...,...,...
199160,5,Great TV!,"Jun 25, 2021 8:35 PM",0,0,Great product for the cost. I would recommend ...,"Samsung - 43"" Class Q60A Series QLED 4K UHD Sm...",17
199720,5,TV,"Oct 9, 2021 8:11 PM",0,0,Great TV; easy to set up; very satisfied with ...,"Samsung - 43"" Class Q60A Series QLED 4K UHD Sm...",45
200252,5,Office TV,"Jan 29, 2022 10:54 PM",0,0,Great Tv lots of features and great looking c...,"Sony - 65"" Class X80J Series LED 4K UHD Smart ...",25
200540,5,Great,"Oct 30, 2021 1:52 AM",0,0,Can’t settle for less. Picture quality is exce...,"Sony - 65"" Class X80J Series LED 4K UHD Smart ...",39


In [18]:
all_reviews.drop_duplicates(["Review", "Page"], inplace=True)

In [19]:
all_reviews.drop_duplicates(["Review"], inplace=True)

In [20]:
## Lastly saving the clean file and checking if everything is right by re-reading it.

all_reviews.to_csv("all_tv_reviews_16_03_22.csv", index = False)

In [21]:
pd.read_csv("all_tv_reviews_16_03_22.csv").info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 201113 entries, 0 to 201112
Data columns (total 8 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   Rating        201113 non-null  int64 
 1   Title         201074 non-null  object
 2   Date          201113 non-null  object
 3   Helpful       201113 non-null  int64 
 4   Unhelpful     201113 non-null  int64 
 5   Review        201113 non-null  object
 6   Product_Name  201113 non-null  object
 7   Page          201113 non-null  int64 
dtypes: int64(4), object(4)
memory usage: 12.3+ MB
