In [1]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from textblob import TextBlob

In [2]:
#Load in our DataFrame
df = pd.read_csv('DailyComments.csv')

In [3]:
df

Unnamed: 0,Day of Week,comments
0,Monday,"Hello, how are you?"
1,Tuesday,Today is a good day!
2,Wednesday,It's my birthday so it's a really special day!
3,Thursday,Today is neither a good day or a bad day!
4,Friday,I'm having a bad day.
5,Saturday,There' s nothing special happening today.
6,Sunday,Today is a SUPER good day!


In [4]:
#Bring in our sentiment reader. We will run it on cleaned and uncleaned DataFrame
def senti(x):
    return TextBlob(x).sentiment

df['Uncleaned Sentiment'] = df.comments.apply(senti)

In [5]:
df

Unnamed: 0,Day of Week,comments,Uncleaned Sentiment
0,Monday,"Hello, how are you?","(0.0, 0.0)"
1,Tuesday,Today is a good day!,"(0.875, 0.6000000000000001)"
2,Wednesday,It's my birthday so it's a really special day!,"(0.44642857142857145, 0.5714285714285714)"
3,Thursday,Today is neither a good day or a bad day!,"(-0.08749999999999991, 0.6333333333333333)"
4,Friday,I'm having a bad day.,"(-0.6999999999999998, 0.6666666666666666)"
5,Saturday,There' s nothing special happening today.,"(0.35714285714285715, 0.5714285714285714)"
6,Sunday,Today is a SUPER good day!,"(0.6041666666666666, 0.6333333333333333)"


In [6]:
#Make a new column so we can clean our comments
df['Cleaned Comments'] = df.comments.astype(str)

In [7]:
#This applies lower() to all the comments
df['Cleaned Comments'] = df['Cleaned Comments'].apply(lambda x: " ".join(x.lower() for x in x.split()))

In [8]:
#This removes all punctuation from our comments
df['Cleaned Comments'] = df['Cleaned Comments'].str.replace('[^\w\s]','')

In [9]:
#This removes stop words that will be useless in our comments
stop = stopwords.words('english')
df['Cleaned Comments'] = df['Cleaned Comments'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))

In [10]:
#This stems all of our words, not useful now, but in a bigger data set will make it run faster
stemmer = PorterStemmer()
df['Cleaned Comments'] = df['Cleaned Comments'].apply(lambda x: " ".join([stemmer.stem(word) for word in x.split()]))

In [11]:
#Now we will print the DataFrame. We have our original comments, the sentiment we got from running the full comment
#And now we have our comments cut down so we can run the sentiment again
df

Unnamed: 0,Day of Week,comments,Uncleaned Sentiment,Cleaned Comments
0,Monday,"Hello, how are you?","(0.0, 0.0)",hello
1,Tuesday,Today is a good day!,"(0.875, 0.6000000000000001)",today good day
2,Wednesday,It's my birthday so it's a really special day!,"(0.44642857142857145, 0.5714285714285714)",birthday realli special day
3,Thursday,Today is neither a good day or a bad day!,"(-0.08749999999999991, 0.6333333333333333)",today neither good day bad day
4,Friday,I'm having a bad day.,"(-0.6999999999999998, 0.6666666666666666)",im bad day
5,Saturday,There' s nothing special happening today.,"(0.35714285714285715, 0.5714285714285714)",noth special happen today
6,Sunday,Today is a SUPER good day!,"(0.6041666666666666, 0.6333333333333333)",today super good day


In [12]:
#This finds the sentiment based on the cleaned comments instead of full comments
df['Cleaned Sentiment'] = df['Cleaned Comments'].apply(senti)

In [13]:
df

Unnamed: 0,Day of Week,comments,Uncleaned Sentiment,Cleaned Comments,Cleaned Sentiment
0,Monday,"Hello, how are you?","(0.0, 0.0)",hello,"(0.0, 0.0)"
1,Tuesday,Today is a good day!,"(0.875, 0.6000000000000001)",today good day,"(0.7, 0.6000000000000001)"
2,Wednesday,It's my birthday so it's a really special day!,"(0.44642857142857145, 0.5714285714285714)",birthday realli special day,"(0.35714285714285715, 0.5714285714285714)"
3,Thursday,Today is neither a good day or a bad day!,"(-0.08749999999999991, 0.6333333333333333)",today neither good day bad day,"(5.551115123125783e-17, 0.6333333333333333)"
4,Friday,I'm having a bad day.,"(-0.6999999999999998, 0.6666666666666666)",im bad day,"(-0.6999999999999998, 0.6666666666666666)"
5,Saturday,There' s nothing special happening today.,"(0.35714285714285715, 0.5714285714285714)",noth special happen today,"(0.35714285714285715, 0.5714285714285714)"
6,Sunday,Today is a SUPER good day!,"(0.6041666666666666, 0.6333333333333333)",today super good day,"(0.5166666666666666, 0.6333333333333333)"


**Explanation and Analysis:**  
  So for my sentiment analysis I used TextBlob, which has a sentiment function to run over our comments. This makes it very easy to apply it to our comments and look at what we get. Of course, with such a small dataset, it makes it easy to make sure it is working correctly and see how it works, as well as some problems we may have with it. I ran the analysis over both the cleaned up comments and the entire comment string, just because I wanted to see if cleaning and going through all the trouble had a big effect and why we do it. For making a word counter or something it makes sense, but for sentiment, it seems like a lot of cleaning up that the sentiment analyzer doesn't need. So lets look at the numbers that our TextBlob(x).sentiment gives us. The first number is the sentiment polarity, where positive numbers indicate positive sentiment, negative numbers indicate negative sentiment and numbers close to 0 show neutrality. If we read through our comments, we would expect Monday, Thursday and Saturday to give us no positive or negative leaning (close to 0), Friday to be negative and the other days to be pretty positive. If I had to manually rank each of the days from happiest to saddest, I would say it should be Sunday, Wednesday, Tuesday, Thursday, Saturday, Monday, Friday. The second number is called the subjectivity score. This seems to be like the variability of the comment, in that some of the words may be interpreted either way the higher this number is. So lets look at some of the problems we ran into and try to figure out how big of problems these are. Most of the actual scores seem relatively correct, Monday, Tuesday, Thursday, Friday and Sunday all seem spot on. Wednesday seems like it should be higher sentiment and happier overall, Saturday seems pretty highly rated despite the tone seeming close to zero (especially when you compare Wednesday and Saturday, the numbers are close but they don't seem close in my mind when reading them).  When we clean up the comments we get a couple changes, such as when exclamation points are removed, the comments take a big dip. This may be one of the indicators for the sentiment analysis that they are mostly used for happy comments. Also I'm not too sure why, but the stemming of our comments changed "really" to "realli", which I don't think is correct, but I could be wrong. Now the question of if we need to stem, remove stop words, lower case, remove punctuation, it seems to not need it. Of course I will if needed, but is there a reason to do it? Maybe in a huge dataset it allows everything to run quicker, but I'm not sure if the extra steps are worth it over running the full comment through. For the steps I walked through, I used: https://data-science-blog.com/en/blog/2018/11/04/sentiment-analysis-using-python/.

## Extra Credit Section  
Link to the CSV can be found at https://www.kaggle.com/austinreese/trump-tweets

In [14]:
#Load in our Trump tweet Data
df_trump = pd.read_csv('realdonaldtrump.csv')
df_trump

Unnamed: 0,id,link,content,date,retweets,favorites,mentions,hashtags
0,1698308935,https://twitter.com/realDonaldTrump/status/169...,Be sure to tune in and watch Donald Trump on L...,2009-05-04 13:54:25,510,917,,
1,1701461182,https://twitter.com/realDonaldTrump/status/170...,Donald Trump will be appearing on The View tom...,2009-05-04 20:00:10,34,267,,
2,1737479987,https://twitter.com/realDonaldTrump/status/173...,Donald Trump reads Top Ten Financial Tips on L...,2009-05-08 08:38:08,13,19,,
3,1741160716,https://twitter.com/realDonaldTrump/status/174...,New Blog Post: Celebrity Apprentice Finale and...,2009-05-08 15:40:15,11,26,,
4,1773561338,https://twitter.com/realDonaldTrump/status/177...,"""My persona will never be that of a wallflower...",2009-05-12 09:07:28,1375,1945,,
...,...,...,...,...,...,...,...,...
43347,1273405198698975232,https://twitter.com/realDonaldTrump/status/127...,Joe Biden was a TOTAL FAILURE in Government. H...,2020-06-17 19:00:32,23402,116377,,
43348,1273408026968457216,https://twitter.com/realDonaldTrump/status/127...,Will be interviewed on @ seanhannity tonight a...,2020-06-17 19:11:47,11810,56659,@seanhannity,
43349,1273442195161387008,https://twitter.com/realDonaldTrump/status/127...,pic.twitter.com/3lm1spbU8X,2020-06-17 21:27:33,4959,19344,,
43350,1273442469066276864,https://twitter.com/realDonaldTrump/status/127...,pic.twitter.com/vpCE5MadUz,2020-06-17 21:28:38,4627,17022,,


In [15]:
df_trump = df_trump.drop(['id', 'link', 'date', 'retweets', 'favorites', 'mentions', 'hashtags'], axis=1)

In [16]:
#We just want the polarity this time
def polar(x):
    return TextBlob(x).sentiment.polarity

In [17]:
#Add the sentiment row
df_trump['Sentiment'] = df_trump.content.apply(polar)

In [18]:
df_trump

Unnamed: 0,content,Sentiment
0,Be sure to tune in and watch Donald Trump on L...,0.175000
1,Donald Trump will be appearing on The View tom...,0.170455
2,Donald Trump reads Top Ten Financial Tips on L...,0.121250
3,New Blog Post: Celebrity Apprentice Finale and...,0.136364
4,"""My persona will never be that of a wallflower...",0.000000
...,...,...
43347,Joe Biden was a TOTAL FAILURE in Government. H...,-0.197917
43348,Will be interviewed on @ seanhannity tonight a...,0.500000
43349,pic.twitter.com/3lm1spbU8X,0.000000
43350,pic.twitter.com/vpCE5MadUz,0.000000


**Explanation:**  
So this should have all the sentiment polarity of Trump's tweets. These go up until June of 2020, but I want to make sure it is working. If I were to clean this up, which I will do below, I want to remove any rows where the polarity is 0. These are usually pictures or videos, which of course have sentiment tied to them, but we can't analyze them through this. I also want to see the mean value of the polarity of his tweets, because you would assume from just recent times it would be negative. But then again, it could be better looked at over time, because maybe before being elected he had more positive messages to send, and then that number lowered as he went through it. It would be fascinating to look more into this, especially over the last couple of months, but thats for a larger scope project. Let me look at a few things below just to see though.

In [19]:
#This removes all the sentiment rows that = 0
df_trump = df_trump[df_trump['Sentiment'] != 0]

In [20]:
df_trump

Unnamed: 0,content,Sentiment
0,Be sure to tune in and watch Donald Trump on L...,0.175000
1,Donald Trump will be appearing on The View tom...,0.170455
2,Donald Trump reads Top Ten Financial Tips on L...,0.121250
3,New Blog Post: Celebrity Apprentice Finale and...,0.136364
6,Listen to an interview with Donald Trump discu...,0.136364
...,...,...
43342,Congressman @ LeeZeldin is a fighter for New Y...,0.045758
43343,. @ LyndaBennettNC has my Complete & Total End...,0.433333
43345,A GREAT woman. Her son is looking down from he...,0.648889
43347,Joe Biden was a TOTAL FAILURE in Government. H...,-0.197917


In [21]:
#Find the mean
df_trump.Sentiment.mean()

0.2440249768270406

**Explanation:** This is somewhat surprising that his tweets end up being more on a positive note. Of course, the past 7 months would probably change this a bit, as long as we had access to tweets that got taken down. But because these tweets go as far back as 2009, I don't think this number is unreasonable. I'll also very quickly clean the comments and run it one last time.

In [22]:
df_trump['Cleaned Comments'] = df_trump.content.astype(str)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_trump['Cleaned Comments'] = df_trump.content.astype(str)


In [23]:
#This applies lower() to all the comments
df_trump['Cleaned Comments'] = df_trump['Cleaned Comments'].apply(lambda x: " ".join(x.lower() for x in x.split()))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_trump['Cleaned Comments'] = df_trump['Cleaned Comments'].apply(lambda x: " ".join(x.lower() for x in x.split()))


In [24]:
#This removes all punctuation from our comments
df_trump['Cleaned Comments'] = df_trump['Cleaned Comments'].str.replace('[^\w\s]','')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_trump['Cleaned Comments'] = df_trump['Cleaned Comments'].str.replace('[^\w\s]','')


In [25]:
#This removes stop words that will be useless in our comments
df_trump['Cleaned Comments'] = df_trump['Cleaned Comments'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_trump['Cleaned Comments'] = df_trump['Cleaned Comments'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))


In [26]:
#This stems all of our words
df_trump['Cleaned Comments'] = df_trump['Cleaned Comments'].apply(lambda x: " ".join([stemmer.stem(word) for word in x.split()]))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_trump['Cleaned Comments'] = df_trump['Cleaned Comments'].apply(lambda x: " ".join([stemmer.stem(word) for word in x.split()]))


In [27]:
#This will apply the sentiment analysis
df_trump['Cleaned Sentiment'] = df['Cleaned Comments'].apply(polar)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_trump['Cleaned Sentiment'] = df['Cleaned Comments'].apply(polar)


In [28]:
df_trump

Unnamed: 0,content,Sentiment,Cleaned Comments,Cleaned Sentiment
0,Be sure to tune in and watch Donald Trump on L...,0.175000,sure tune watch donald trump late night david ...,0.000000e+00
1,Donald Trump will be appearing on The View tom...,0.170455,donald trump appear view tomorrow morn discuss...,7.000000e-01
2,Donald Trump reads Top Ten Financial Tips on L...,0.121250,donald trump read top ten financi tip late sho...,3.571429e-01
3,New Blog Post: Celebrity Apprentice Finale and...,0.136364,new blog post celebr apprentic final lesson le...,5.551115e-17
6,Listen to an interview with Donald Trump discu...,0.136364,listen interview donald trump discuss new book...,5.166667e-01
...,...,...,...,...
43342,Congressman @ LeeZeldin is a fighter for New Y...,0.045758,congressman leezeldin fighter new york serv co...,
43343,. @ LyndaBennettNC has my Complete & Total End...,0.433333,lyndabennettnc complet total endors great figh...,
43345,A GREAT woman. Her son is looking down from he...,0.648889,great woman son look heaven proud wonder love ...,
43347,Joe Biden was a TOTAL FAILURE in Government. H...,-0.197917,joe biden total failur govern bungl everyth touch,


In [29]:
df_trump['Cleaned Sentiment'].mean()

0.31476190476190474

**Explanation:** So I got a lot of errors when setting this up. Not exactly sure why it was doing everything I wanted, but was upset I was doing it. I think you can see it. I also got a lot of NaN values with the cleaned up text, so I'm not too sure what is wrong. The mean sentiment of the tweets went up, but thats because when you run df_trump['Cleaned Sentiment'].min() the minimum score we now have is 0. So something happened with our numbers and I'm not sure what. Overall, we get a good idea without running the text cleaning of what we want and would be good enough I assume. 