Pre-processing and read in files


In [138]:
import pandas as pd
import os


In [2]:
workspace = r"D:\User\Documents\SMU CONTENT\Year 3 Sem 2\IS450\Project\Main\Exploration"
os.chdir(workspace)
path = r'TM_LDA_coherence200+.csv'
topicData = pd.read_csv(path)
df1 = pd.DataFrame(topicData)

In [3]:
columns_interest = ['ProductId', 'Original', 'Text', 'product_category', 'Automated_topic_id']

In [4]:
df2 = df1[columns_interest].copy()

In [5]:
df2.head()

Unnamed: 0,ProductId,Original,Text,product_category,Automated_topic_id
0,B000LKU03G,This is my family's favorite brand of wheat fr...,This is my family's favorite brand of wheat fr...,'Cakes',quality
1,B000LKU03G,This is my family's favorite brand of wheat fr...,"This brand is moist, tasty and closer to the t...",'Cakes',taste
2,B000LKU03G,This is my family's favorite brand of wheat fr...,Namaste products are the best in my opinion of...,'Cakes',quality
3,B000LKU03G,This is my family's favorite brand of wheat fr...,Try it,'Cakes',quality
4,B000LKU03G,This is my family's favorite brand of wheat fr...,You won't be disappointed,'Cakes',taste


Clean the product category column values, additional '' marks are include, we want to remove them

In [6]:
def clean_product_cat(category):
    """
    clean product category columns
    """    
    return category.rstrip("/'").lstrip(" '")

In [7]:
df2['product_category'] = df2['product_category'].apply(clean_product_cat)

In [153]:
def clean_text(text):
    
    pattern = r'<[^>.]*>'
    text = re.sub(pattern, '', text)
    return text

In [156]:
df2['Text'] = df2['Text'].apply(clean_text)

In [157]:
df2.head()

Unnamed: 0,ProductId,Original,Text,product_category,Automated_topic_id
0,B000LKU03G,This is my family's favorite brand of wheat fr...,This is my family's favorite brand of wheat fr...,Cakes,quality
1,B000LKU03G,This is my family's favorite brand of wheat fr...,"This brand is moist, tasty and closer to the t...",Cakes,taste
2,B000LKU03G,This is my family's favorite brand of wheat fr...,Namaste products are the best in my opinion of...,Cakes,quality
3,B000LKU03G,This is my family's favorite brand of wheat fr...,Try it,Cakes,quality
4,B000LKU03G,This is my family's favorite brand of wheat fr...,You won't be disappointed,Cakes,taste


Create a list of categories 

In [158]:
categoryLs = df2.groupby('product_category').count().sort_values(by = 'Text').index.tolist()

In [159]:
categoryLs[1]

'False Eyelashes & Adhesives'

Helper functions

In [219]:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
import time

In [161]:
def create_df(data,product_cat = 'Canola'):
    """
    returns a new df from product_cat
    """
    df1 = data[data['product_category'] == product_cat].copy()
    return df1

In [220]:
def main(data):
    cv = CountVectorizer()
    finalDf = pd.DataFrame(columns = ['categoryID', 'topic', 'summary', 'originalText'])
#     for i in range(10):

    for i in range(len(categoryLs)):
        if i > 160:
            time.sleep(10)
        temp_df = create_df(data, categoryLs[i])
        # this steps generates word counts for the words in your docs
        topic_ls = list(temp_df['Automated_topic_id'].unique())
        for j in range(len(topic_ls)):
            temp_df2 = temp_df[temp_df['Automated_topic_id'] == topic_ls[j]].copy()
            text_ls = temp_df2['Text'].tolist()
            word_count_vector=cv.fit_transform(text_ls)    
#             word_count_vector.shape
            tfidf_transformer=TfidfTransformer(smooth_idf=True,use_idf=True)
            tfidf_transformer.fit(word_count_vector)
            # print idf values
            df_idf = pd.DataFrame(tfidf_transformer.idf_, index=cv.get_feature_names(),columns=["idf_weights"])

            # sort ascending
            df_idf.sort_values(by=['idf_weights'])

            # count matrix
            count_vector=cv.transform(text_ls)

            # tf-idf scores
            tf_idf_vector=tfidf_transformer.transform(count_vector)

            feature_names = cv.get_feature_names()

            #get tfidf vector for first document
            first_document_vector=tf_idf_vector[0]

            #print the scores
            df = pd.DataFrame(first_document_vector.T.todense(), index=feature_names, columns=["tfidf"])
            df = df.sort_values(by=["tfidf"],ascending=False)
            
            test_ls = [x.todense().sum() for x in tf_idf_vector]
            test_df = pd.DataFrame(zip(text_ls,test_ls), columns = ["Text", "Weighted - TFIDF"]).sort_values(by = ['Weighted - TFIDF'], ascending =False)
            summaries = '. '.join(test_df['Text'].tolist()[:4]) # take the first 4 text
            finalDf = finalDf.append({'categoryID' : categoryLs[i],'topic' : topic_ls[j], 'summary' : summaries,'originalText':'. '.join(text_ls)}, ignore_index=True)
    return finalDf


In [221]:
summary_df = main(df2)

In [222]:
summary_df

Unnamed: 0,categoryID,topic,summary,originalText
0,Sunflowers,Others,I figured that water and the provided plant fo...,"When the flowers showed up, they were in prett..."
1,False Eyelashes & Adhesives,taste,These truffles melt slowly in your mouth with ...,These truffles melt slowly in your mouth with ...
2,Pastry Shells & Crusts,Others,It's hard to find those things in grocery stor...,Was quite pleased with the product. Would buy ...
3,Pastry Shells & Crusts,quality,"The service was excellent, and the patty shell...",Arrived on time and wrapped well. The service ...
4,Basic Collars,taste,I have a Pit Bull who is a little over a year ...,I have a Pit Bull who is a little over a year ...
...,...,...,...,...
443,Snacks,taste,"So because they love them so much, these are n...","In the future, I'd be more apt to open up a 69..."
444,Fruit & Nut,taste,Ingredients(All ingredients are gluten-free)Or...,I would have liked to give these 2 and a half ...
445,Peanut Butter,taste,I am so excited to have finally found a way to...,"It's got the typical Jif taste, which to me is..."
446,Peanut Butter,quality,", but it takes more of those things that are n...",Good for dipping pretzels or celery/other vegg...


In [230]:
summary_df1 = summary_df.copy()

In [231]:
summary_df1['originalText'] = summary_df1['originalText'].apply(lambda x: x.replace("\'",''))

In [232]:
summary_df1['summary'] = summary_df1['summary'].apply(lambda x: x.replace("\'",''))

Looking at an example summary

In [235]:
summary_df1['summary'].iloc[120]

'I ordered this after a friend told me about all the great features & benefits and was so surprised at how great the coffee tasted, ease of use and the value added with all the additional parts. For those of us who cant stroll down to the local store and buy fresh ones (and I expect there are many of you out there), I think these are a good way to go. I was so excited to try a sweet hot sauce, my hubby and I are obsessed with buying new hot sauces, its actually the first thing we look to purchase when we are traveling. ( Did I mention coffee snob?)Having a way to make an easy,fast and excellent cup of coffee makes getting up in the morning something to really look forward to'

In [236]:
summary_df1['originalText'].iloc[120]

'The bouquet is evocative and tended to linger in the air yet was in no way overpowering. A fine impression remains when the cup is set down and the intake of breath recharges the sensation. Some brews become distinctly unpleasant when the leaves are introduced to certain mineral-rich well water or treated city water. However, for a proper moment of relaxation-- or an attempt to induce one in the midst of chaos. This hot sauce is delicious. I put it on chinese food, mexican food, macaroni & cheese, everything. If youre looking for something really tasty and with a kick (and youre bored of Tabasco, Texas Pete, and all those) then you should try some of this stuff. I was so excited to try a sweet hot sauce, my hubby and I are obsessed with buying new hot sauces, its actually the first thing we look to purchase when we are traveling. YAHOO. This is some of the finest hot sauce Ive had. I highly reccomend putting it on pizza, breakfast burritos, steaks, chicken, everything. Buy it, you wil

Write to file

In [237]:
summary_df1.to_csv('TF-IDF_summary.csv')