### Note on large .csv files ###
The Reddit Climate Change Dataset (pavellexyr) and Sentiment140 (kazanova) are huge files with over 1 million observations. I cannot upload them to GitHub so here are the links to each one from Kaggle. There is also one dataset that is mentioned but has no associated code: Twitter US Airline Sentiment (crowdflower). This is included below if, for any reason, it is needed in the future.

The Reddit Climate Change Dataset (pavellexyr): https://www.kaggle.com/datasets/pavellexyr/the-reddit-climate-change-dataset

Sentiment140 dataset with 1.6 million tweets (kazanova): https://www.kaggle.com/datasets/kazanova/sentiment140

Twitter US Airline Sentiment (crowdflower): https://www.kaggle.com/datasets/crowdflower/twitter-airline-sentiment

In [1]:
import numpy as np
import pandas as pd

In [2]:
#not having the limit breaks the kernel. This takes only the first 100K observations per dataset
row_count = 1000
max_obv = 100

In [4]:
pip install langdetect

Collecting langdetect
  Using cached langdetect-1.0.9.tar.gz (981 kB)
  Preparing metadata (setup.py) ... [?25ldone
Installing collected packages: langdetect
[33m  DEPRECATION: langdetect is being installed using the legacy 'setup.py install' method, because it does not have a 'pyproject.toml' and the 'wheel' package is not installed. pip 23.1 will enforce this behaviour change. A possible replacement is to enable the '--use-pep517' option. Discussion can be found at https://github.com/pypa/pip/issues/8559[0m[33m
[0m  Running setup.py install for langdetect ... [?25ldone
[?25hSuccessfully installed langdetect-1.0.9

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m23.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3 install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [5]:
#https://pypi.org/project/langdetect/

### uncomment this to install, then comment and restart kernel ###
# %%capture
# !pip install langdetect

from langdetect import detect
from langdetect.lang_detect_exception import LangDetectException
### regex for more lang checking
import re

### (pavellexyr) The Reddit Climate Change Dataset ###
Only using the-reddit-climate-change-dataset-comments.csv which has a column for sentiment for the text. This file alone is over 4GB and processing all of it breaks the kernel so only the first 100K are processed. The columns of interest are 'body' for the comment's body text, and 'sentiment' for the analyzed sentiment per text. NaN values are dropped, leaving a little under 100K usuable observations. 

This is the only dataset that has continuous sentiment values [-1, 1] instead of discrete {-1, 0, 1} so if using ALL of these datasets together as one, it may be biased toward those discrete values than anything in between.

In [51]:
rclimate_text = []
rclimate_sentiment = []
i = 0
for chunk in pd.read_csv('the-reddit-climate-change-dataset-comments.csv', chunksize=row_count):
    if i < max_obv:
        rclimate_text += chunk['body'].tolist()
        rclimate_sentiment += chunk['sentiment'].tolist()
        i += 1
    else:
        break

In [52]:
rclimate_df = pd.DataFrame(data={'text': rclimate_text, 'sentiment': rclimate_sentiment})
rclimate_df = rclimate_df.dropna()
# rclimate_df.shape #expected (98422, 2)

In [44]:
def validate_line(line):
    if not line:
        return np.nan
    if line == "":
        return np.nan
    if not bool(line.strip()):
        return np.nan
    if len(line) < 1:
        return np.nan
    
    if bool(re.match('^(?=.*[a-zA-Z])', line)):
        try:
            if detect(line) != 'en':
                return np.nan
        except LangDetectException:
            return np.nan
    return True

In [42]:
### expect 10 minutes to run for 100k rows
### text_col is df['text']
### sentiment_col is df['sentiment']
### returns three lists of the same length

def check_en(text_col, sentiment_col):
    en_text = text_col.tolist()
    en_sentiment = sentiment_col.tolist()
    lang = []
    
    start = 0
    for i in np.arange(row_count, len(en_text), row_count):
        #observations <1000 at the end will be lost but impact is negligible
        #!!!uncomment print statement below to show progress (recommended)!!!
#         print(start, i)
        lang += [validate_line(x) for x in en_text[start:i]]
        start = i
    print("Finished English check")
    ### all three return values should be of the same length
    return en_text[0:len(lang)], en_sentiment[0:len(lang)], lang

In [53]:
en_climate_text, en_climate_sentiment, en_climate_lang = check_en(rclimate_df['text'], rclimate_df['sentiment'])
#all should be len 98000

KeyboardInterrupt: 

In [None]:
en_climate_df = pd.DataFrame(data={'text': en_climate_text, 'sentiment': en_climate_sentiment, 'english': en_climate_lang})
en_climate_df = en_climate_df.dropna()
en_climate_df = en_climate_df.drop(columns=['english'])
# print(en_climate_df.shape) #expected (97759, 2)

In [None]:
def sentiment_to_string(sentiment):
    if type(sentiment) == int or type(sentiment) == float:
        if sentiment < 0:
            return "negative"
        if sentiment > 0:
            return "positive"
        return "neutral"
    else:
        return sentiment

In [None]:
en_climate_df['sentiment'] = en_climate_df['sentiment'].apply(sentiment_to_string)
# en_climate_df

### (cosmos98) Twitter and Reddit Sentimental analysis Dataset ###
Like with the dataset above, observations are limited to the first 100K and reduced to not have NaN values. Numeric values of {-1, 0, 1} are changed to {negative, neutral, positive} respectively

In [6]:
cosmos_twitter_text = []
cosmos_twitter_sentiment = []
i = 0
for chunk in pd.read_csv('Twitter_Data.csv', chunksize=row_count):
    if i < max_obv:
        cosmos_twitter_text += chunk['clean_text'].tolist()
        cosmos_twitter_sentiment += chunk['category'].tolist()
        i += 1
    else:
        break

cosmos_reddit_text = []
cosmos_reddit_sentiment = []
i = 0
for chunk in pd.read_csv('Reddit_Data.csv', chunksize=row_count):
    if i <max_obv:
        cosmos_reddit_text += chunk['clean_comment'].tolist()
        cosmos_reddit_sentiment += chunk['category'].tolist()
        i += 1
    else:
        break

In [7]:
cosmos_twitter_df = pd.DataFrame(data={'text': cosmos_twitter_text, 'sentiment': cosmos_twitter_sentiment}).dropna()
cosmos_reddit_df = pd.DataFrame(data={'text': cosmos_reddit_text, 'sentiment': cosmos_reddit_sentiment}).dropna()
# cosmos_twitter_df.shape #expected (99999, 2)
# cosmos_reddit_df.shape #expected (37149, 2)

In [8]:
cosmos_twitter_df

Unnamed: 0,text,sentiment
0,when modi promised “minimum government maximum...,-1
1,talk all the nonsense and continue all the dra...,0
2,what did just say vote for modi welcome bjp t...,1
3,asking his supporters prefix chowkidar their n...,1
4,answer who among these the most powerful world...,1
...,...,...
99995,only modi theres question compare freebies wit...,0
99996,modi disgrace always people who have expectati...,1
99997,why want vote for modi,0
99998,why didnt people say this intolerance all beca...,0


### (cosmos98) Twitter dataset ###

In [14]:
en_cosmos_twitter_text, en_cosmos_twitter_sentiment, en_cosmos_twitter_lang = check_en(cosmos_twitter_df['text'], cosmos_twitter_df['sentiment'])

Finished English check


In [15]:
en_cosmos_twitter_df = pd.DataFrame(data={'text': en_cosmos_twitter_text, 'sentiment':en_cosmos_twitter_sentiment, 'english':en_cosmos_twitter_lang})
en_cosmos_twitter_df = en_cosmos_twitter_df.dropna()
en_cosmos_twitter_df = en_cosmos_twitter_df.drop(columns=['english'])
print(en_cosmos_twitter_df.shape) #expected (91952, 2)

(91822, 2)


In [16]:
en_cosmos_twitter_df['sentiment'] = en_cosmos_twitter_df['sentiment'].apply(sentiment_to_string)
# en_cosmos_twitter_df

Unnamed: 0,text,sentiment
0,when modi promised “minimum government maximum...,negative
1,talk all the nonsense and continue all the dra...,neutral
2,what did just say vote for modi welcome bjp t...,positive
3,asking his supporters prefix chowkidar their n...,positive
4,answer who among these the most powerful world...,positive
...,...,...
98995,india cant survive another term modi,neutral
98996,modi hands down indians are too influenced bol...,negative
98997,rajdeep known congress supporters from day one...,neutral
98998,its bcoz they hate modi thats,negative


### (cosmos98) Reddit dataset

In [17]:
en_cosmos_reddit_text, en_cosmos_reddit_sentiment, en_cosmos_reddit_lang = check_en(cosmos_reddit_df['text'], cosmos_reddit_df['sentiment'])

Finished English check


In [18]:
en_cosmos_reddit_df = pd.DataFrame(data={'text': en_cosmos_reddit_text, 'sentiment': en_cosmos_reddit_sentiment, 'english':en_cosmos_reddit_lang})
en_cosmos_reddit_df = en_cosmos_reddit_df.dropna()
en_cosmos_reddit_df = en_cosmos_reddit_df.drop(columns=['english'])
# print(en_cosmos_reddit_df.shape) #expected (31669, 2)

In [19]:
en_cosmos_reddit_df['sentiment'] = en_cosmos_reddit_df['sentiment'].apply(sentiment_to_string)
# en_cosmos_reddit_df

### NOT USABLE -- (kazanova) Sentiment140 dataset with 1.6 million tweets -- NOT USABLE ###
Dataset originally has 1.6 million tweets, limited to 100K. Dropped NaN values

Unlike the previous datasets, sentiment is recorded as 0=negative, 2=neutral, 4=positive.
However, the target column that is supposed to record this is entirely 0 which is unlikely for a set of 1.6 million tweets. Therefore, the sentiment from this dataset cannot be used with the others. It might be saved for a seperate purpose.

In [20]:
# kazanova_text = []
# kazanova_sentiment = []
# #this dataset used the first observation as the columns
# col_text = 5
# col_sentiment = 0
# i = 0
# for chunk in pd.read_csv('118Adatasets/kazanova_sentiment140.csv', chunksize=row_count):
#     if i < max_obv:
#         kazanova_text += chunk.iloc[:,5].tolist()
#         kazanova_sentiment += chunk.iloc[:,0].tolist()
#         i += 1
#     else:
#         break

In [21]:
# kazanova_df = pd.DataFrame(data={'text':kazanova_text, 'sentiment':kazanova_sentiment})
# kazanova_df['sentiment'].unique()

### NOT USABLE -- (crowdflower) Twitter US Airline Sentiment -- NOT USABLE ###
Dataset has Tweet id for the text, but not the actual text itself. Sentiment is recorded in strings as 'positive', 'negative' and 'neutral' which can be converted to numerical values of 1, -1, and 0 respectively. There is no text readily available in the file so I will not write code for the numeric conversion here. Unlike the kazanova dataset, we likely cannot use this for another purpose without taking significant time to extract the text from the Tweet id for each observation. 

### (tirendazacademy) FIFA World Cup 2022 Tweets ###
Has Tweet text body and sentiment in strings of 'positive', 'negative' and 'neutral' which are unchanged

In [38]:
fifa_text = []
fifa_sentiment = []
i = 0
for chunk in pd.read_csv('fifa_world_cup_2022_tweets.csv', chunksize=row_count):
    if i < max_obv:
        fifa_text += chunk['Tweet'].tolist()
        fifa_sentiment += chunk['Sentiment'].tolist()
        i += 1
    else:
        break

In [39]:
fifa_df = pd.DataFrame(data={'text':fifa_text,'sentiment':fifa_sentiment}).dropna()
# fifa_df

In [45]:
en_fifa_text, en_fifa_sentiment, en_fifa_lang = check_en(fifa_df['text'], fifa_df['sentiment'])

Finished English check


In [46]:
en_fifa_df = pd.DataFrame(data={'text': en_fifa_text, 'sentiment': en_fifa_sentiment, 'english': en_fifa_lang})
en_fifa_df = en_fifa_df.dropna()
en_fifa_df = en_fifa_df.drop(columns=['english'])
# print(en_fifa_df.shape) #expected (21798, 2)

Below are the cleaned DataFrames, having only the columns 'text' (unchanged) and 'sentiment' which has values "positive", "negative" or "neutral". The maximum number of observations per dataset is 100k rows, but any NaN values and text bodies that were not detected to be English (truncated to the nearest thousand) are dropped to make the data more manageable.

In [26]:
# en_climate_df #shape (97759, 2)
# en_cosmos_reddit_df #shape (31669, 2)
# en_cosmos_twitter_df #shape (91952, 2)
# en_fifa_df #shape (21798, 2)

In [54]:
en_climate_df = pd.read_csv('en_climate_df.csv')
en_climate_df

Unnamed: 0.1,Unnamed: 0,text,sentiment
0,0,Yeah but what the above commenter is saying is...,positive
1,1,Any comparison of efficiency between solar and...,negative
2,2,I'm honestly waiting for climate change and th...,negative
3,3,Not just Sacramento. It's actually happening a...,neutral
4,4,I think climate change tends to get some peopl...,positive
...,...,...,...
97744,97995,It’s almost like climate change is real. Huh.,positive
97745,97996,I'm not sure I agree. More Americans are consi...,positive
97746,97997,If 40 billion could fix climate change why did...,neutral
97747,97998,"There are a lot of answers to climate change, ...",neutral


### Added from Deepansha ###

In [55]:
#changed variable names from Deepansha's original
X = en_climate_df['text']
Y = en_climate_df['sentiment']

In [56]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state=15, train_size=0.6)

In [57]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.neighbors import KNeighborsClassifier

countVectorizer = CountVectorizer()
X_train = countVectorizer.fit_transform(X_train)

In [30]:
# countVectorizer.vocabulary_.get('climate')

In [58]:
tf_transformer = TfidfTransformer(use_idf=False).fit(X_train)
X_train = tf_transformer.transform(X_train)

In [59]:
clf = KNeighborsClassifier(3).fit(X_train, Y_train)

In [60]:
tester = ['yayyy!', 'terrible', "she walked to the right", "woohoo", "I don't feel good", "sad", "feel kinda blue"]
X_new_counts = countVectorizer.transform(tester)
X_new_tfidf = tf_transformer.transform(X_new_counts)
predicted = clf.predict(X_new_tfidf)
predicted

array(['positive', 'negative', 'positive', 'positive', 'positive',
       'negative', 'neutral'], dtype=object)

### Added from Raina ###

### Test Reddit Climate Change Dataset Trained Model Using Other Datasets ###

In [91]:
# Test with cosmos redit sentiment data
cosmos_reddit_test = pd.read_csv('en_cosmos_reddit_df.csv')
cosmos_reddit_test_data = cosmos_reddit_test['text']
cosmos_reddit_test_data

0         family mormon have never tried explain them t...
1        buddhism has very much lot compatible with chr...
2        seriously don say thing first all they won get...
3        what you have learned yours and only yours wha...
4        for your own benefit you may want read living ...
                               ...                        
31643            coincidentally that how randia works too 
31644               here screen cap his live telecast jpg 
31645              shit forgot today date take upvote sir 
31646     fell for this good one also shows overtly opt...
31647            thought now going ban 2000 rupee note lol
Name: text, Length: 31648, dtype: object

In [92]:
X_new_counts = countVectorizer.transform(cosmos_reddit_test_data)
X_new_tfidf = tf_transformer.transform(X_new_counts)
predicted = clf.predict(X_new_tfidf)
predicted

array(['positive', 'negative', 'negative', ..., 'negative', 'positive',
       'negative'], dtype=object)

In [93]:
cosmos_reddit_test_sentiment = cosmos_reddit_test['sentiment']

correct = 0
passed = 0
for i in range(len(predicted)):
    try:
        if cosmos_reddit_test_sentiment[i] == predicted[i]:
            correct += 1
    except KeyError:
        passed += 1
        pass
print("correct: ", correct)
print("passed: ", passed)
print("accuracy: ", correct/(len(predicted)-passed))

correct:  12522
passed:  0
accuracy:  0.39566481294236605


In [97]:
# Test with cosmos twitter sentiment data
cosmos_twitter_test = pd.read_csv('en_cosmos_twitter_df.csv')
cosmos_twitter_test_data = cosmos_twitter_test['text']
cosmos_twitter_test_data

0        when modi promised “minimum government maximum...
1        talk all the nonsense and continue all the dra...
2        what did just say vote for modi  welcome bjp t...
3        asking his supporters prefix chowkidar their n...
4        answer who among these the most powerful world...
                               ...                        
91953                 india cant survive another term modi
91954    modi hands down indians are too influenced bol...
91955    rajdeep known congress supporters from day one...
91956                       its bcoz they hate modi thats 
91957    narendra modi begin campaign trail address pub...
Name: text, Length: 91958, dtype: object

In [98]:
X_new_counts = countVectorizer.transform(cosmos_twitter_test_data)
X_new_tfidf = tf_transformer.transform(X_new_counts)
predicted = clf.predict(X_new_tfidf)
predicted

array(['negative', 'negative', 'negative', ..., 'positive', 'negative',
       'neutral'], dtype=object)

In [99]:
cosmos_twitter_test_sentiment = cosmos_twitter_test['sentiment']

correct = 0
passed = 0
for i in range(len(predicted)):
    try:
        if cosmos_twitter_test_sentiment[i] == predicted[i]:
            correct += 1
    except KeyError:
        passed += 1
        pass
print("correct: ", correct)
print("passed: ", passed)
print("accuracy: ", correct/(len(predicted)-passed))

correct:  32575
passed:  0
accuracy:  0.3542378042149677


In [100]:
# Test with fifa twitter data
fifa_test = pd.read_csv('en_fifa_df.csv')
fifa_test_data = fifa_test['text']
fifa_test_data

0        What are we drinking today @TucanTribe \n@MadB...
1        Amazing @CanadaSoccerEN  #WorldCup2022 launch ...
2        Worth reading while watching #WorldCup2022 htt...
3        Golden Maknae shinning bright\n\nhttps://t.co/...
4        If the BBC cares so much about human rights, h...
                               ...                        
21793    Leave #FIFA alone! Let the #soccer prevail and...
21794    Three stars on this logo after WC\n#WorldCup20...
21795    What is really sickening is the west trying to...
21796    Messi’s last World Cup! I am going with Argent...
21797    Who will win the World Cup 2022?\n🏆⚽️\n\n@FIFA...
Name: text, Length: 21798, dtype: object

In [101]:
X_new_counts = countVectorizer.transform(fifa_test_data)
X_new_tfidf = tf_transformer.transform(X_new_counts)
predicted = clf.predict(X_new_tfidf)
predicted

array(['negative', 'positive', 'positive', ..., 'positive', 'positive',
       'positive'], dtype=object)

In [102]:
fifa_test_sentiment = fifa_test['sentiment']

correct = 0
passed = 0
for i in range(len(predicted)):
    try:
        if fifa_test_sentiment[i] == predicted[i]:
            correct += 1
    except KeyError:
        passed += 1
        pass
print("correct: ", correct)
print("passed: ", passed)
print("accuracy: ", correct/(len(predicted)-passed))

correct:  7799
passed:  0
accuracy:  0.3577851179007248


### Train on Different Datasets ###

##### Twitter Sentiment Analysis #####

In [9]:
cosmos_twitter_df

Unnamed: 0,text,sentiment
0,when modi promised “minimum government maximum...,-1
1,talk all the nonsense and continue all the dra...,0
2,what did just say vote for modi welcome bjp t...,1
3,asking his supporters prefix chowkidar their n...,1
4,answer who among these the most powerful world...,1
...,...,...
99995,only modi theres question compare freebies wit...,0
99996,modi disgrace always people who have expectati...,1
99997,why want vote for modi,0
99998,why didnt people say this intolerance all beca...,0


In [18]:
cosmos_twitter_df["sentiment"] = cosmos_twitter_df["sentiment"].apply(lambda curr: "negative" if float(curr) < 0 else ("positive" if float(curr) > 0 else "neutral"))
cosmos_twitter_df

Unnamed: 0,text,sentiment
0,when modi promised “minimum government maximum...,negative
1,talk all the nonsense and continue all the dra...,neutral
2,what did just say vote for modi welcome bjp t...,positive
3,asking his supporters prefix chowkidar their n...,positive
4,answer who among these the most powerful world...,positive
...,...,...
99995,only modi theres question compare freebies wit...,neutral
99996,modi disgrace always people who have expectati...,positive
99997,why want vote for modi,neutral
99998,why didnt people say this intolerance all beca...,neutral


In [19]:
X = cosmos_twitter_df['text']
Y = cosmos_twitter_df['sentiment']

In [20]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state=15, train_size=0.6)

In [21]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.neighbors import KNeighborsClassifier

countVectorizer = CountVectorizer()
X_train = countVectorizer.fit_transform(X_train)

In [22]:
tf_transformer = TfidfTransformer(use_idf=False).fit(X_train)
X_train = tf_transformer.transform(X_train)

In [23]:
clf = KNeighborsClassifier(3).fit(X_train, Y_train)

In [24]:
tester = ['yayyy!', 'terrible', "she walked to the right", "woohoo", "I don't feel good", "sad", "feel kinda blue"]
X_new_counts = countVectorizer.transform(tester)
X_new_tfidf = tf_transformer.transform(X_new_counts)
predicted = clf.predict(X_new_tfidf)
predicted

array(['neutral', 'neutral', 'neutral', 'neutral', 'positive', 'negative',
       'neutral'], dtype=object)

##### Reddit Sentiment Analysis #####

In [None]:
cosmos_reddit_df["sentiment"] = cosmos_reddit_df["sentiment"].apply(lambda curr: "negative" if float(curr) < 0 else ("positive" if float(curr) > 0 else "neutral"))

In [28]:
cosmos_reddit_df

Unnamed: 0,text,sentiment
0,family mormon have never tried explain them t...,positive
1,buddhism has very much lot compatible with chr...,positive
2,seriously don say thing first all they won get...,negative
3,what you have learned yours and only yours wha...,neutral
4,for your own benefit you may want read living ...,positive
...,...,...
37244,jesus,neutral
37245,kya bhai pure saal chutiya banaya modi aur jab...,positive
37246,downvote karna tha par upvote hogaya,neutral
37247,haha nice,positive


In [36]:
X = cosmos_reddit_df['text']
Y = cosmos_reddit_df['sentiment']

from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state=15, train_size=0.6)

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.neighbors import KNeighborsClassifier

countVectorizer = CountVectorizer()
X_train = countVectorizer.fit_transform(X_train)

tf_transformer = TfidfTransformer(use_idf=False).fit(X_train)
X_train = tf_transformer.transform(X_train)

clf = KNeighborsClassifier(3).fit(X_train, Y_train)

In [37]:
tester = ['yayyy!', 'terrible', "she walked to the right", "woohoo", "I don't feel good", "sad", "feel kinda blue"]
X_new_counts = countVectorizer.transform(tester)
X_new_tfidf = tf_transformer.transform(X_new_counts)
predicted = clf.predict(X_new_tfidf)
predicted

array(['neutral', 'negative', 'positive', 'neutral', 'positive',
       'negative', 'neutral'], dtype=object)

#### FIFA Dataset ####

In [47]:
en_fifa_df

Unnamed: 0,text,sentiment
0,What are we drinking today @TucanTribe \n@MadB...,neutral
1,Amazing @CanadaSoccerEN #WorldCup2022 launch ...,positive
2,Worth reading while watching #WorldCup2022 htt...,positive
4,"If the BBC cares so much about human rights, h...",negative
5,"And like, will the mexican fans be able to scr...",negative
...,...,...
21995,Leave #FIFA alone! Let the #soccer prevail and...,positive
21996,Three stars on this logo after WC\n#WorldCup20...,positive
21997,What is really sickening is the west trying to...,negative
21998,Messi’s last World Cup! I am going with Argent...,positive


In [48]:
X = en_fifa_df['text']
Y = en_fifa_df['sentiment']

from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state=15, train_size=0.6)

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.neighbors import KNeighborsClassifier

countVectorizer = CountVectorizer()
X_train = countVectorizer.fit_transform(X_train)

tf_transformer = TfidfTransformer(use_idf=False).fit(X_train)
X_train = tf_transformer.transform(X_train)

clf = KNeighborsClassifier(3).fit(X_train, Y_train)

In [49]:
tester = ['yayyy!', 'terrible', "she walked to the right", "woohoo", "I don't feel good", "sad", "feel kinda blue"]
X_new_counts = countVectorizer.transform(tester)
X_new_tfidf = tf_transformer.transform(X_new_counts)
predicted = clf.predict(X_new_tfidf)
predicted

array(['negative', 'negative', 'neutral', 'positive', 'positive',
       'negative', 'negative'], dtype=object)