<a href="https://colab.research.google.com/github/IYNESHDURAI/DiabetesRiskPrediction/blob/Flipkart/cleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1>Data Cleaning </h1>

<h3>Loading data from given dataset</h3>

The dataset file has to be stored with the name 'dataset.csv'.   
We first load the dataset file as a df using pandas.   
We extract the product categories tree and description as we wish to build our model using these two columns

In [None]:
import pandas as pd
df = pd.read_csv("dataset.csv",usecols = ['product_category_tree','description'])
print(df)

                                   product_category_tree  \
0      ["Clothing >> Women's Clothing >> Lingerie, Sl...   
1      ["Furniture >> Living Room Furniture >> Sofa B...   
2      ["Footwear >> Women's Footwear >> Ballerinas >...   
3      ["Clothing >> Women's Clothing >> Lingerie, Sl...   
4      ["Pet Supplies >> Grooming >> Skin & Coat Care...   
...                                                  ...   
19995  ["Baby Care >> Baby & Kids Gifts >> Stickers >...   
19996  ["Baby Care >> Baby & Kids Gifts >> Stickers >...   
19997  ["Baby Care >> Baby & Kids Gifts >> Stickers >...   
19998  ["Baby Care >> Baby & Kids Gifts >> Stickers >...   
19999  ["Baby Care >> Baby & Kids Gifts >> Stickers >...   

                                             description  
0      Key Features of Alisha Solid Women's Cycling S...  
1      FabHomeDecor Fabric Double Sofa Bed (Finish Co...  
2      Key Features of AW Bellies Sandals Wedges Heel...  
3      Key Features of Alisha Solid Women's

<h3>Extracting categories from the product category tree </h3>

 In the dataset, we have been provided with the entire product category tree which leads us to the product, every category has been divided into few subcategories which lead to the final product. Since the length of each product is variable and every product has atleast one main category(root of tree), we will be using the wider categories as the main category for classificaion.

In [None]:
df2 = pd.read_csv("dataset.csv",usecols = ['product_category_tree','description'])
cat_map =  dict()#cat map is a dictionary that stores all classes and their frequency
for index in df2.index:
    x = df2.loc[index,'product_category_tree']
    x = x.strip("[]\"")#removing unnecessary symbols from the tree
    y = x.split(">>")#splitting the tree into an array of categories
    z = y[0].strip().lower()#removing trailing spaces and lowercasing
    if(z in cat_map.keys()):
        cat_map[z] += 1
    else:
        cat_map[z] = 1
    df2.loc[index,'product_category_tree'] = z
print(df2)

      product_category_tree                                        description
0                  clothing  Key Features of Alisha Solid Women's Cycling S...
1                 furniture  FabHomeDecor Fabric Double Sofa Bed (Finish Co...
2                  footwear  Key Features of AW Bellies Sandals Wedges Heel...
3                  clothing  Key Features of Alisha Solid Women's Cycling S...
4              pet supplies  Specifications of Sicons All Purpose Arnica Do...
...                     ...                                                ...
19995             baby care  Buy WallDesign Small Vinyl Sticker for Rs.730 ...
19996             baby care  Buy Wallmantra Large Vinyl Stickers Sticker fo...
19997             baby care  Buy Elite Collection Medium Acrylic Sticker fo...
19998             baby care  Buy Elite Collection Medium Acrylic Sticker fo...
19999             baby care  Buy Elite Collection Medium Acrylic Sticker fo...

[20000 rows x 2 columns]


In [None]:
print(cat_map['clothing'])

6198


<h3> Dealing with categories that have extremely few data points</h3>

We wish to make our model capable of predicting a large number of categories, however too many categories have only one or two datapoints only, so we will be clubbing them under an 'others' category

In [None]:
for index in df2.index:
    x = df2.loc[index,'product_category_tree']
    if(cat_map[x]<=10):#categories with less than 10 points go to others category
        df2.loc[index,'product_category_tree'] = 'others'

In [None]:
print(df2)

      product_category_tree                                        description
0                  clothing  Key Features of Alisha Solid Women's Cycling S...
1                 furniture  FabHomeDecor Fabric Double Sofa Bed (Finish Co...
2                  footwear  Key Features of AW Bellies Sandals Wedges Heel...
3                  clothing  Key Features of Alisha Solid Women's Cycling S...
4              pet supplies  Specifications of Sicons All Purpose Arnica Do...
...                     ...                                                ...
19995             baby care  Buy WallDesign Small Vinyl Sticker for Rs.730 ...
19996             baby care  Buy Wallmantra Large Vinyl Stickers Sticker fo...
19997             baby care  Buy Elite Collection Medium Acrylic Sticker fo...
19998             baby care  Buy Elite Collection Medium Acrylic Sticker fo...
19999             baby care  Buy Elite Collection Medium Acrylic Sticker fo...

[20000 rows x 2 columns]


<h3>Cleaning the description text</h3>

We will be performing some standard procedures on the description text so that it can be ready to be used efficiently in the model such as removing special expressions, accented characters(if available), stopwords, lemmetization etc

<h4>Removing accented characters</h4>

In [None]:
import unicodedata
# function to remove accented characters
def remove_accented_chars(text):
    new_text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    return new_text

<h4>Removing punctuation,special characters and numbers</h4>

In [None]:
# imports
import re
# function to remove special characters
def remove_special_characters(text):
    # define the pattern to keep
    pat = r'[^a-zA-z0-9.,!?/:;\"\'\s]'
    return re.sub(pat, '', text)
def remove_numbers(text):
    # define the pattern to keep
    pattern = r'[^a-zA-z.,!?/:;\"\'\s]'
    return re.sub(pattern, '', text)
# imports
import string
# function to remove punctuation
def remove_punctuation(text):
    text = ''.join([c for c in text if c not in string.punctuation])
    return text

<h4>Lemmetization</h4>

In [None]:
# imports
import nltk
import spacy
nlp = spacy.load('en_core_web_sm')
def get_lem(text):
    text = nlp(text)
    text = ' '.join([word.lemma_ if word.lemma_ != '-PRON-' else word.text for word in text])
    return text

In [None]:
print(get_lem("we are eating and swimming ; we have been eating and swimming ; he eats and swims ; he ate and swam "))

we be eat and swimming ; we have be eat and swim ; he eat and swim ; he eat and swam


<h4>Removing Stopwords</h4>

In [None]:
from nltk.corpus import stopwords
nltk.download('stopwords')
from nltk.tokenize import word_tokenize

def remove_stopwords(text):
    text_tokens = word_tokenize(text)

    tokens_without_sw = [word for word in text_tokens if not word in stopwords.words()]
    filtered_sentence = (" ").join(tokens_without_sw)
    return(filtered_sentence)

[nltk_data] Downloading package stopwords to /home/anant/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
remove_stopwords("Nick likes to play football, however he is not too fond of tennis.")

'Nick likes play football , however fond tennis .'

In [None]:
%config Completer.use_jedi = False

<h4>Removing unnecessary whitespaces and tabs </h4>


In [None]:
def remove_extra_whitespace_tabs(text):
    #pattern = r'^\s+$|\s+$'
    pattern = r'^\s*|\s\s*'
    return re.sub(pattern, ' ', text).strip()
def to_lowercase(text):
    return text.lower()

In [None]:
print(df2)

      product_category_tree                                        description
0                  clothing  Key Features of Alisha Solid Women's Cycling S...
1                 furniture  FabHomeDecor Fabric Double Sofa Bed (Finish Co...
2                  footwear  Key Features of AW Bellies Sandals Wedges Heel...
3                  clothing  Key Features of Alisha Solid Women's Cycling S...
4              pet supplies  Specifications of Sicons All Purpose Arnica Do...
...                     ...                                                ...
19995             baby care  Buy WallDesign Small Vinyl Sticker for Rs.730 ...
19996             baby care  Buy Wallmantra Large Vinyl Stickers Sticker fo...
19997             baby care  Buy Elite Collection Medium Acrylic Sticker fo...
19998             baby care  Buy Elite Collection Medium Acrylic Sticker fo...
19999             baby care  Buy Elite Collection Medium Acrylic Sticker fo...

[20000 rows x 2 columns]


<h4> Before performing all these functions on our data we will remove the examples which has no description at all </h4>

In [None]:
for index in df2.index:
    x = df2.loc[index,'description']
    if(isinstance(x,str)==False):
        print(x)
        print(index)


nan
553
nan
17299


<h4> Since there are only 2 such examples we remove them directly </h4>

In [None]:

df_new = df2.drop([553,17299])
print(df_new)

      product_category_tree                                        description
0                  clothing  Key Features of Alisha Solid Women's Cycling S...
1                 furniture  FabHomeDecor Fabric Double Sofa Bed (Finish Co...
2                  footwear  Key Features of AW Bellies Sandals Wedges Heel...
3                  clothing  Key Features of Alisha Solid Women's Cycling S...
4              pet supplies  Specifications of Sicons All Purpose Arnica Do...
...                     ...                                                ...
19995             baby care  Buy WallDesign Small Vinyl Sticker for Rs.730 ...
19996             baby care  Buy Wallmantra Large Vinyl Stickers Sticker fo...
19997             baby care  Buy Elite Collection Medium Acrylic Sticker fo...
19998             baby care  Buy Elite Collection Medium Acrylic Sticker fo...
19999             baby care  Buy Elite Collection Medium Acrylic Sticker fo...

[19998 rows x 2 columns]


<h4>Performing all cleaning operations on the description field</h4>

We iterate through the dataset and clean the description field for all examples

In [None]:
for index in df_new.index:
    x = df_new.loc[index,'description']
    x = to_lowercase(x)
    x = remove_extra_whitespace_tabs(x)
    x = remove_accented_chars(x)
    x = remove_special_characters(x)
    x = remove_numbers(x)
    x = remove_punctuation(x)
    x = remove_stopwords(x)
    df_new.loc[index,'description'] = x
print(df_new)

In [None]:
df_parsed = df_new
for index in df_parsed.index:
    x = df_parsed.loc[index,'description']
    x = get_lem(x)
    df_parsed.loc[index,'description'] = x


Saving current progress

In [None]:
print(df_parsed)

      product_category_tree                                        description
0                  clothing  key feature alisha solid women cycling short c...
1                 furniture  fabhomedecor fabric double sofa bed finish col...
2                  footwear  key feature aw belly sandal wedge heel casuals...
3                  clothing  key feature alisha solid women cycling short c...
4              pet supplies  specification sicon purpose arnica shampoo ml ...
...                     ...                                                ...
19995             baby care  buy walldesign small vinyl sticker rs online w...
19996             baby care  buy wallmantra large vinyl sticker sticker rs ...
19997             baby care  buy elite collection medium acrylic sticker rs...
19998             baby care  buy elite collection medium acrylic sticker rs...
19999             baby care  buy elite collection medium acrylic sticker rs...

[19998 rows x 2 columns]


In [None]:
df_parsed.to_csv('out_parsed.csv',index = False)

<h3> Attaching labels to the each product category</h3>

We will assign each product category a number as we do in most text classification tasks, it will make it easy to work with the data

We wish to make a classifier that can predict maximum number of categories, however few categories still have less datapoints. We thus make 3 csv files. In first file all the categories with less than 10 datapoints are assigned an 'others category' which we have already done.For the second and third we set this cutoff as 100 and 500 respectively. We will be using the next two files only if classification doesn't work on the first file as one of our aims also is to make the classifier to predict a large number of categories.


In [None]:
label_list1 =  dict()
counter1 = 0
label_list2 =  dict()
counter2 = 0
label_list3 =  dict()
counter3 = 0

for k,v in cat_map.items():
    if(v>10):
        label_list1[k]=counter1
        counter1+= 1
    if(v>100):
        label_list2[k]=counter2
        counter2+= 1
    if(v>500):
        label_list3[k]=counter3
        counter3+= 1
label_list1['others'] = counter1
label_list2['others'] = counter2
label_list3['others'] = counter3

print(label_list1)
print(label_list2)
print(label_list3)

{'clothing': 0, 'furniture': 1, 'footwear': 2, 'pet supplies': 3, 'pens & stationery': 4, 'sports & fitness': 5, 'beauty and personal care': 6, 'bags, wallets & belts': 7, 'home decor & festive needs': 8, 'automotive': 9, 'tools & hardware': 10, 'home furnishing': 11, 'baby care': 12, 'mobiles & accessories': 13, 'watches': 14, 'toys & school supplies': 15, 'jewellery': 16, 'sunglasses': 17, 'kitchen & dining': 18, 'home & kitchen': 19, 'computers': 20, 'cameras & accessories': 21, 'health & personal care appliances': 22, 'gaming': 23, 'home improvement': 24, 'home entertainment': 25, 'ebooks': 26, 'others': 27}
{'clothing': 0, 'furniture': 1, 'footwear': 2, 'pens & stationery': 3, 'sports & fitness': 4, 'beauty and personal care': 5, 'bags, wallets & belts': 6, 'home decor & festive needs': 7, 'automotive': 8, 'tools & hardware': 9, 'home furnishing': 10, 'baby care': 11, 'mobiles & accessories': 12, 'watches': 13, 'toys & school supplies': 14, 'jewellery': 15, 'kitchen & dining': 16,

In [None]:
df_parsed = pd.read_csv('out_parsed.csv')

In [None]:
print(df_parsed)

      product_category_tree                                        description
0                  clothing  key feature alisha solid women cycling short c...
1                 furniture  fabhomedecor fabric double sofa bed finish col...
2                  footwear  key feature aw belly sandal wedge heel casuals...
3                  clothing  key feature alisha solid women cycling short c...
4              pet supplies  specification sicon purpose arnica shampoo ml ...
...                     ...                                                ...
19993             baby care  buy walldesign small vinyl sticker rs online w...
19994             baby care  buy wallmantra large vinyl sticker sticker rs ...
19995             baby care  buy elite collection medium acrylic sticker rs...
19996             baby care  buy elite collection medium acrylic sticker rs...
19997             baby care  buy elite collection medium acrylic sticker rs...

[19998 rows x 2 columns]


<h4>Assigning labels from the label list to their corresponding categories and saving the files

In [None]:
df_final1=df_parsed
df_final1['label'] = df_final1['product_category_tree']
df_final1 = df_final1.replace({'label':label_list1})
print(df_final1)

      product_category_tree  \
0                  clothing   
1                 furniture   
2                  footwear   
3                  clothing   
4              pet supplies   
...                     ...   
19993             baby care   
19994             baby care   
19995             baby care   
19996             baby care   
19997             baby care   

                                             description  label  
0      key feature alisha solid women cycling short c...      0  
1      fabhomedecor fabric double sofa bed finish col...      1  
2      key feature aw belly sandal wedge heel casuals...      2  
3      key feature alisha solid women cycling short c...      0  
4      specification sicon purpose arnica shampoo ml ...      3  
...                                                  ...    ...  
19993  buy walldesign small vinyl sticker rs online w...     12  
19994  buy wallmantra large vinyl sticker sticker rs ...     12  
19995  buy elite collection mediu

In [None]:
df_final1.to_csv('outfinal1.csv',index=False)

In [None]:
df_final2=df_parsed
for index in df_final2.index:
    x = df_final2.loc[index,'product_category_tree']
    if(x=='others'):
        continue
    if(cat_map[x]<=100):
      df_final2.loc[index,'product_category_tree'] = 'others'
df_final2['label'] = df_final2['product_category_tree']
df_final2 = df_final2.replace({'label':label_list2})

In [None]:
df_final2.to_csv('outfinal2.csv',index=False)

In [None]:
df_final3=df_parsed
for index in df_final3.index:
    x = df_final3.loc[index,'product_category_tree']
    if(x=='others'):
        continue
    if(cat_map[x]<=500):
      df_final3.loc[index,'product_category_tree'] = 'others'
df_final3['label'] = df_final3['product_category_tree']
df_final3 = df_final3.replace({'label':label_list3})
df_final3.to_csv('outfinal3.csv',index=False)