### In this Notebook, the second part of Lab1 in the DataMining course is performed to improve maintainability and readability.

## Table of Contents
1. Data Source
2. Data Preparation 
3. Data Transformation
-  Converting Dictionary into Pandas dataframe
4. Data Mining using Pandas
 - 4.1 Dealing with Missing Values
 - 4.2 Dealing with Duplicate Data
5. Data Preprocessing
 - 5.1 Sampling
 - 5.2 Feature Creation
 - 5.3 Feature Subset Selection
 - 5.4 Dimensionality Reduction
 - 5.5 Atrribute Transformation / Aggregation
 - 5.6 Discretization and Binarization
6. Data Exploration
7. Conclusion


## 1. The Data
In this notebook we will explore Sentiment Labelled sentences, The used dataset could be found [here](https://archive.ics.uci.edu/dataset/331/sentiment+labelled+sentences). The dataset contains sentences labelled with positive or negative sentiment. Additional information about the dataset, provided by the author of the website above, is provided below:

     This dataset was created for the Paper 'From Group to Individual Labels using Deep Features', Kotzias et. al,. KDD 2015
     Please cite the paper if you want to use it :

     It contains sentences labelled with positive or negative sentiment.

     # Format:

     sentence 	 score 

     # Details:

     Score is either 1 (for positive) or 0 (for negative)	
     The sentences come from three different websites/fields:

     imdb.com
     amazon.com
     yelp.com

     For each website, there exist 500 positive and 500 negative sentences. Those were selected randomly for larger datasets  of reviews. 
     We attempted to select sentences that have a clearly positive or negative connotaton, the goal was for no neutral sentences to be selected. 

## 2 - 3. Data Preparation & Data Transformation
In this section we will extract the data from their files into one single panda dataframe.
The dataframe will ultimatly have the format:

INDEX   |   sentence    |    Score  |   Source    Positive/Negative   

In [65]:
import os
import pandas as pd
import sys
sys.path.append('../')
import helpers.data_mining_helpers as dmh

current_folder = os.getcwd()
file_path = current_folder + "\sentiment+labelled+sentences\sentimentlabelledsentences\\" #Change file_path to fit where you put the txt_files

category = {0 : 'Negative', 1 : 'Positive'}



In [66]:
yelp_df = pd.read_csv(file_path + "yelp_labelled.txt", names=['sentence', 'Score'], sep='\t')
yelp_df['Source'] = 'Yelp'
print("Yelp DF shape: " + str(yelp_df.shape))

Yelp DF shape: (1000, 3)


In [67]:
amazon_df = pd.read_csv(os.path.join(file_path + "amazon_cells_labelled.txt"), names=['sentence', 'Score'], sep='\t')
amazon_df['Source'] = 'Amazon'
print("Amazon DF shape: " + str(amazon_df.shape))

Amazon DF shape: (1000, 3)


In [68]:

imdb_df = pd.read_csv(os.path.join(file_path + "imdb_labelled.txt"), names=['sentence', 'Score'], sep='\t')
imdb_df['Source'] = 'IMDB'
print("IMDB DF shape: " + str(imdb_df.shape))
#Since it seems to be something wrong during the import of the IMDB records (only 748 records...) 
# I will import those records using classic read from file operations and then  sperate the string & lines
file = open(file_path + "imdb_labelled.txt")
imdb_raw_data = file.read()
file.close
imdb_raw_data = imdb_raw_data.split('\n')

x = []
sentence = []
score = []
for line in imdb_raw_data[0:1000]: #Use index 0:1000 to avoid the last empty line. 
    x = line.split('\t')
    sentence.append(x[0])
    score.append(x[-1])
imdb_df = pd.DataFrame({'sentence' : sentence, 'Score' : score})
imdb_df['Source'] = 'IMDB'
print("New IMDB DF shape: " + str(imdb_df.shape))


IMDB DF shape: (748, 3)
New IMDB DF shape: (1000, 3)


In [69]:

df = pd.concat([amazon_df, yelp_df, imdb_df], axis=0)
df

Unnamed: 0,sentence,Score,Source
0,So there is no way for me to plug it in here i...,0,Amazon
1,"Good case, Excellent value.",1,Amazon
2,Great for the jawbone.,1,Amazon
3,Tied to charger for conversations lasting more...,0,Amazon
4,The mic is great.,1,Amazon
...,...,...,...
995,I just got bored watching Jessice Lange take h...,0,IMDB
996,"Unfortunately, any virtue in this film's produ...",0,IMDB
997,"In a word, it is embarrassing.",0,IMDB
998,Exceptionally bad!,0,IMDB


In [70]:
len(df)

3000

In [71]:
df[0:2]

Unnamed: 0,sentence,Score,Source
0,So there is no way for me to plug it in here i...,0,Amazon
1,"Good case, Excellent value.",1,Amazon


In [72]:
for sentence in df['sentence'][:3]:
    print(sentence)

So there is no way for me to plug it in here in the US unless I go by a converter.
Good case, Excellent value.
Great for the jawbone.


In [73]:
#Add Positive/Negative column to the DF
df['Positive/Negative'] = df.Score.apply(lambda t: dmh.format_labels_lab2(t, category))
df[0:10]

Unnamed: 0,sentence,Score,Source,Positive/Negative
0,So there is no way for me to plug it in here i...,0,Amazon,Negative
1,"Good case, Excellent value.",1,Amazon,Positive
2,Great for the jawbone.,1,Amazon,Positive
3,Tied to charger for conversations lasting more...,0,Amazon,Negative
4,The mic is great.,1,Amazon,Positive
5,I have to jiggle the plug to get it to line up...,0,Amazon,Negative
6,If you have several dozen or several hundred c...,0,Amazon,Negative
7,If you are Razr owner...you must have this!,1,Amazon,Positive
8,"Needless to say, I wasted my money.",0,Amazon,Negative
9,What a waste of money and time!.,0,Amazon,Negative


In [74]:
# a simple query
df[:10][["sentence","Positive/Negative"]]

Unnamed: 0,sentence,Positive/Negative
0,So there is no way for me to plug it in here i...,Negative
1,"Good case, Excellent value.",Positive
2,Great for the jawbone.,Positive
3,Tied to charger for conversations lasting more...,Negative
4,The mic is great.,Positive
5,I have to jiggle the plug to get it to line up...,Negative
6,If you have several dozen or several hundred c...,Negative
7,If you are Razr owner...you must have this!,Positive
8,"Needless to say, I wasted my money.",Negative
9,What a waste of money and time!.,Negative


In [75]:
#Last 10
df[-10:]

Unnamed: 0,sentence,Score,Source,Positive/Negative
990,"The opening sequence of this gem is a classic,...",1,IMDB,Positive
991,Fans of the genre will be in heaven.,1,IMDB,Positive
992,Lange had become a great actress.,1,IMDB,Positive
993,It looked like a wonderful story.,1,IMDB,Positive
994,I never walked out of a movie faster.,0,IMDB,Negative
995,I just got bored watching Jessice Lange take h...,0,IMDB,Negative
996,"Unfortunately, any virtue in this film's produ...",0,IMDB,Negative
997,"In a word, it is embarrassing.",0,IMDB,Negative
998,Exceptionally bad!,0,IMDB,Negative
999,All in all its an insult to one's intelligence...,0,IMDB,Negative


In [77]:
#using loc (by label)
#df.loc[:10,"sentence"]
duplicates = df.duplicated()
print(sum(df.duplicated()))
print(df[duplicates])
'''
Since there is rows that are considered 
duplicated rows the loc can't be used,
but after further investigation it seems
like they isn't really duplicates therefore
a new extra index column is added... 
'''
import numpy as np
df['index'] = np.arange(3000)
df[0:10]
duplicates = df.duplicated()
print(sum(df.duplicated()))
df.loc[:10, 'sentence']



0
Empty DataFrame
Columns: [sentence, Score, Source, Positive/Negative, index]
Index: []
0


KeyError: 'Cannot get right slice bound for non-unique label: 10'