# Data Cleaning

##### 1) Reading the CSV file:

#####  df = pd.read_csv(r"C:\Users\ISMAIL\Downloads\archive (10)\7817_1.csv") reads the CSV file into a pandas DataFrame named df.

##### 2) df.head() displays the first few rows of the DataFrame to get an overview of the data.
##### 3) Handling missing values:

#####  df.dropna(subset=['reviews.text'], inplace=True) drops the rows where the 'reviews.text' column is missing. This step removes instances with missing review text, assuming that it is a crucial feature for the project.

##### df.fillna(value={'reviews.rating': 0}, inplace=True) fills the missing values in the 'reviews.rating' column with zeros. This assumes that missing ratings correspond to zero ratings.
##### 4) Cleaning the data:

##### df['dateAdded'] = pd.to_datetime(df['dateAdded']) converts the 'dateAdded' column to datetime format for easier manipulation and analysis.
##### df['reviews.rating'] = df['reviews.rating'].astype(int) converts the 'reviews.rating' column to integer type.
##### Cleaning the 'reviews.text' column:

##### The clean_text() function is defined to clean the review text by removing special characters, digits, converting to lowercase, tokenizing, removing stopwords, and joining the filtered tokens back into a string.
##### df['reviews.text'] = df['reviews.text'].apply(clean_text) applies the clean_text() function to each value in the 'reviews.text' column, cleaning the text data.

##### 5) Saving the cleaned dataset:

##### df.to_csv('cleaned_dataset_AOPR.csv', index=False) saves the cleaned DataFrame to a CSV file named 'cleaned_dataset_AOPR.csv' without including the index column.
##### Overall, these steps involve handling missing values, converting data types, and cleaning the 'reviews.text' column to prepare the dataset for further analysis or modeling tasks.






In [1]:
import pandas as pd

In [3]:
df = pd.read_csv(r"C:\Users\ISMAIL\Downloads\archive (10)\7817_1.csv")  
df.head()


Unnamed: 0,id,asins,brand,categories,colors,dateAdded,dateUpdated,dimension,ean,keys,...,reviews.rating,reviews.sourceURLs,reviews.text,reviews.title,reviews.userCity,reviews.userProvince,reviews.username,sizes,upc,weight
0,AVpe7AsMilAPnD_xQ78G,B00QJDU3KY,Amazon,"Amazon Devices,mazon.co.uk",,2016-03-08T20:21:53Z,2017-07-18T23:52:58Z,169 mm x 117 mm x 9.1 mm,,kindlepaperwhite/b00qjdu3ky,...,5.0,https://www.amazon.com/Kindle-Paperwhite-High-...,I initially had trouble deciding between the p...,"Paperwhite voyage, no regrets!",,,Cristina M,,,205 grams
1,AVpe7AsMilAPnD_xQ78G,B00QJDU3KY,Amazon,"Amazon Devices,mazon.co.uk",,2016-03-08T20:21:53Z,2017-07-18T23:52:58Z,169 mm x 117 mm x 9.1 mm,,kindlepaperwhite/b00qjdu3ky,...,5.0,https://www.amazon.com/Kindle-Paperwhite-High-...,Allow me to preface this with a little history...,One Simply Could Not Ask For More,,,Ricky,,,205 grams
2,AVpe7AsMilAPnD_xQ78G,B00QJDU3KY,Amazon,"Amazon Devices,mazon.co.uk",,2016-03-08T20:21:53Z,2017-07-18T23:52:58Z,169 mm x 117 mm x 9.1 mm,,kindlepaperwhite/b00qjdu3ky,...,4.0,https://www.amazon.com/Kindle-Paperwhite-High-...,I am enjoying it so far. Great for reading. Ha...,Great for those that just want an e-reader,,,Tedd Gardiner,,,205 grams
3,AVpe7AsMilAPnD_xQ78G,B00QJDU3KY,Amazon,"Amazon Devices,mazon.co.uk",,2016-03-08T20:21:53Z,2017-07-18T23:52:58Z,169 mm x 117 mm x 9.1 mm,,kindlepaperwhite/b00qjdu3ky,...,5.0,https://www.amazon.com/Kindle-Paperwhite-High-...,I bought one of the first Paperwhites and have...,Love / Hate relationship,,,Dougal,,,205 grams
4,AVpe7AsMilAPnD_xQ78G,B00QJDU3KY,Amazon,"Amazon Devices,mazon.co.uk",,2016-03-08T20:21:53Z,2017-07-18T23:52:58Z,169 mm x 117 mm x 9.1 mm,,kindlepaperwhite/b00qjdu3ky,...,5.0,https://www.amazon.com/Kindle-Paperwhite-High-...,I have to say upfront - I don't like coroporat...,I LOVE IT,,,Miljan David Tanic,,,205 grams


In [9]:
#Checking for missing values
df.dropna(subset=['reviews.text'], inplace=True)  # Dropping rows with missing review text


# Filling missing ratings with 0
df.fillna(value={'reviews.rating': 0}, inplace=True) 


In [13]:
#dropping all this columns because they are not important to the project
#columns_to_drop = ['asins', 'colors', 'dateUpdated', 'dimension', 'ean', 'keys', 'manufacturer',
                   #'manufacturerNumber', 'prices', 'reviews.date', 'reviews.doRecommend',
                   #'reviews.numHelpful', 'reviews.sourceURLs', 'reviews.userCity',
                   #'reviews.userProvince', 'sizes', 'upc', 'weight']
#df.drop(columns=columns_to_drop, inplace=True)


In [14]:
df.head(10)

Unnamed: 0,id,brand,categories,dateAdded,name,reviews.rating,reviews.text,reviews.title,reviews.username
0,AVpe7AsMilAPnD_xQ78G,Amazon,"Amazon Devices,mazon.co.uk",2016-03-08T20:21:53Z,Kindle Paperwhite,5.0,I initially had trouble deciding between the p...,"Paperwhite voyage, no regrets!",Cristina M
1,AVpe7AsMilAPnD_xQ78G,Amazon,"Amazon Devices,mazon.co.uk",2016-03-08T20:21:53Z,Kindle Paperwhite,5.0,Allow me to preface this with a little history...,One Simply Could Not Ask For More,Ricky
2,AVpe7AsMilAPnD_xQ78G,Amazon,"Amazon Devices,mazon.co.uk",2016-03-08T20:21:53Z,Kindle Paperwhite,4.0,I am enjoying it so far. Great for reading. Ha...,Great for those that just want an e-reader,Tedd Gardiner
3,AVpe7AsMilAPnD_xQ78G,Amazon,"Amazon Devices,mazon.co.uk",2016-03-08T20:21:53Z,Kindle Paperwhite,5.0,I bought one of the first Paperwhites and have...,Love / Hate relationship,Dougal
4,AVpe7AsMilAPnD_xQ78G,Amazon,"Amazon Devices,mazon.co.uk",2016-03-08T20:21:53Z,Kindle Paperwhite,5.0,I have to say upfront - I don't like coroporat...,I LOVE IT,Miljan David Tanic
5,AVpe7AsMilAPnD_xQ78G,Amazon,"Amazon Devices,mazon.co.uk",2016-03-08T20:21:53Z,Kindle Paperwhite,0.0,"My previous kindle was a DX, this is my second...",Great device for reading. 8 people found this ...,Kelvin Law
6,AVpe7AsMilAPnD_xQ78G,Amazon,"Amazon Devices,mazon.co.uk",2016-03-08T20:21:53Z,Kindle Paperwhite,0.0,Allow me to preface this with a little history...,One Simply Could Not Ask For More 28 people fo...,Ricky
7,AVpe7AsMilAPnD_xQ78G,Amazon,"Amazon Devices,mazon.co.uk",2016-03-08T20:21:53Z,Kindle Paperwhite,0.0,Just got mine right now. Looks the same as the...,Definitely better than the previous generation...,Bandler
8,AVpe7AsMilAPnD_xQ78G,Amazon,"Amazon Devices,mazon.co.uk",2016-03-08T20:21:53Z,Kindle Paperwhite,0.0,I initially had trouble deciding between the p...,"Paperwhite voyage, no regrets! 16 people found...",Cristina M
9,AVpe7AsMilAPnD_xQ78G,Amazon,"Amazon Devices,mazon.co.uk",2016-03-08T20:21:53Z,Kindle Paperwhite,0.0,I am enjoying it so far. Great for reading. Ha...,Great for those that just want an e-reader 19 ...,Tedd Gardiner


In [15]:
df['dateAdded'] = pd.to_datetime(df['dateAdded'])
df['reviews.rating'] = df['reviews.rating'].astype(int)


In [16]:
#cleaning the review.text column
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

def clean_text(text):
    # Remove special characters and digits
    text = re.sub(r'[^\w\s]', '', text)
    text = re.sub(r'\d+', '', text)
    
    # Convert text to lowercase
    text = text.lower()
    
    # Tokenize the text
    tokens = word_tokenize(text)
    
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [token for token in tokens if token not in stop_words]
    
    # Join the filtered tokens back into a string
    cleaned_text = ' '.join(filtered_tokens)
    
    return cleaned_text

df['reviews.text'] = df['reviews.text'].apply(clean_text)

In [17]:
#Saving as cvs file
df.to_csv('cleaned_dataset_AOPR.csv', index=False)
