<a href="https://colab.research.google.com/github/AceroMike/Natural-Language-Processing/blob/main/Text_Generation_using_Markov_Chains.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install markovify

In [3]:
# Imports
import numpy as np
import pandas as pd
import nltk
from time import time
import re
import spacy
import markovify
from sqlalchemy import create_engine
import warnings
warnings.filterwarnings("ignore")


For this notebook we will be learning from twitter post data to generate text. The text generation will be generated using Markov Chains. First I load the data, code is hidden since data comes from a private server. 

In [6]:
twitter_df.head(10)

Unnamed: 0,index,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)
5,5,570300767074181121,negative,1.0,Can't Tell,0.6842,Virgin America,,jnardino,,0,@VirginAmerica seriously would pay $30 a fligh...,,2015-02-24 11:14:33 -0800,,Pacific Time (US & Canada)
6,6,570300616901320704,positive,0.6745,,0.0,Virgin America,,cjmcginnis,,0,"@VirginAmerica yes, nearly every time I fly VX...",,2015-02-24 11:13:57 -0800,San Francisco CA,Pacific Time (US & Canada)
7,7,570300248553349120,neutral,0.634,,,Virgin America,,pilot,,0,@VirginAmerica Really missed a prime opportuni...,,2015-02-24 11:12:29 -0800,Los Angeles,Pacific Time (US & Canada)
8,8,570299953286942721,positive,0.6559,,,Virgin America,,dhepburn,,0,"@virginamerica Well, I didn't…but NOW I DO! :-D",,2015-02-24 11:11:19 -0800,San Diego,Pacific Time (US & Canada)
9,9,570295459631263746,positive,1.0,,,Virgin America,,YupitsTate,,0,"@VirginAmerica it was amazing, and arrived an ...",,2015-02-24 10:53:27 -0800,Los Angeles,Eastern Time (US & Canada)


We have a lot of variables here, However, we will only be using 2 of the features which is the text of the tweet and the airline sentiment. We will generate text by looking at tweets that are classified as negative and positive to see how the text generated differs. But first, we want to clean the data. Let's make a simpler data frame with the data that we will be using


In [7]:
twitter_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14640 entries, 0 to 14639
Data columns (total 16 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   index                         14640 non-null  int64  
 1   tweet_id                      14640 non-null  int64  
 2   airline_sentiment             14640 non-null  object 
 3   airline_sentiment_confidence  14640 non-null  float64
 4   negativereason                9178 non-null   object 
 5   negativereason_confidence     10522 non-null  float64
 6   airline                       14640 non-null  object 
 7   airline_sentiment_gold        40 non-null     object 
 8   name                          14640 non-null  object 
 9   negativereason_gold           32 non-null     object 
 10  retweet_count                 14640 non-null  int64  
 11  text                          14640 non-null  object 
 12  tweet_coord                   1019 non-null   object 
 13  t

In [8]:
twitter = twitter_df.drop(columns=['tweet_id', 'airline_sentiment_confidence', 'negativereason', 'negativereason_confidence', 'airline', 'airline_sentiment_gold',
                                   'name', 'negativereason_gold', 'retweet_count', 'tweet_coord', 'tweet_created', 'tweet_location', 'user_timezone'])
twitter.head()

Unnamed: 0,index,airline_sentiment,text
0,0,neutral,@VirginAmerica What @dhepburn said.
1,1,positive,@VirginAmerica plus you've added commercials t...
2,2,neutral,@VirginAmerica I didn't today... Must mean I n...
3,3,negative,@VirginAmerica it's really aggressive to blast...
4,4,negative,@VirginAmerica and it's a really big bad thing...


Now we have only the columns that we will be working with. Now I want to clean the dataframe. I will start by making sure there are no missing values. 

In [9]:
twitter.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14640 entries, 0 to 14639
Data columns (total 3 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   index              14640 non-null  int64 
 1   airline_sentiment  14640 non-null  object
 2   text               14640 non-null  object
dtypes: int64(1), object(2)
memory usage: 343.2+ KB


Now we want to clean the data. But first, let's separate the dataframe into the negative and positive tweets

In [10]:
twitter['airline_sentiment'].value_counts()

negative    9178
neutral     3099
positive    2363
Name: airline_sentiment, dtype: int64

We see that the negative tweets far outnumber the positive tweets. This should not be surprising if like me, you believe that people are more likely to complain about a company than to comment on good service/work.

In [11]:
negative_twitter = twitter[twitter['airline_sentiment'] == 'negative']
positive_twitter = twitter[twitter['airline_sentiment'] == 'positive']

Now we can proceed to cleaning the text.

In [13]:
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
nlp.add_pipe(nlp.create_pipe('sentencizer'))
nlp.max_length = 20000000
docn = nlp(" ".join(negative_twitter.text))

In [14]:
docn[0:100]

@VirginAmerica it's really aggressive to blast obnoxious "entertainment" in your guests' faces &amp; they have little recourse @VirginAmerica and it's a really big bad thing about it @VirginAmerica seriously would pay $30 a flight for seats that didn't have this playing.
it's really the only bad thing about flying VA @VirginAmerica SFO-PDX schedule is still MIA. @VirginAmerica  I flew from NYC to SFO last week and couldn't fully sit in my seat due to two large gentleman on either side of me.

In [15]:
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
nlp.add_pipe(nlp.create_pipe('sentencizer'))
nlp.max_length = 20000000
docp = nlp(" ".join(positive_twitter.text))

In [16]:
docp[0:100]

@VirginAmerica plus you've added commercials to the experience... tacky. @VirginAmerica yes, nearly every time I fly VX this “ear worm” won’t go away :) @virginamerica Well, I didn't…but NOW I DO! :-D @VirginAmerica it was amazing, and arrived an hour early. You're too good to me. @VirginAmerica I &lt;3 pretty graphics. so much better than minimal iconography. :D @VirginAmerica This is such a great deal! Already thinking about my 2nd trip to @Australia &amp; I haven't even gone

Now we can use Markovify to generate sentences, but first we have to group each document into sentences. 

In [31]:
# Group into sentences
# Use the sentences that have more than one character
neg_sents = " ".join([sent.text for sent in docn.sents if len(sent.text) > 1])
neg_sents[0:500]

'@VirginAmerica it\'s really aggressive to blast obnoxious "entertainment" in your guests\' faces &amp; they have little recourse @VirginAmerica and it\'s a really big bad thing about it @VirginAmerica seriously would pay $30 a flight for seats that didn\'t have this playing. \nit\'s really the only bad thing about flying VA @VirginAmerica SFO-PDX schedule is still MIA. @VirginAmerica  I flew from NYC to SFO last week and couldn\'t fully sit in my seat due to two large gentleman on either side of me. HE'

In [33]:
pos_sents = " ".join([sent.text for sent in docp.sents if len(sent.text) > 1])
pos_sents[0:500]

"@VirginAmerica plus you've added commercials to the experience... tacky. @VirginAmerica yes, nearly every time I fly VX this “ear worm” won’t go away :) @virginamerica Well, I didn't…but NOW I DO! :-D @VirginAmerica it was amazing, and arrived an hour early. You're too good to me. @VirginAmerica I &lt;3 pretty graphics. so much better than minimal iconography. :D @VirginAmerica This is such a great deal! Already thinking about my 2nd trip to @Australia &amp; I haven't even gone on my 1st trip ye"

In [20]:
# Good tweets
positive_generator = markovify.Text(pos_sents, state_size = 3)
# Bad tweets
negative_generator = markovify.Text(neg_sents, state_size = 3)

Great! Now let's take a look at some example sentences!

In [22]:
# Positive
# Ten randomly generated sentences
for i in range(10):
    print(positive_generator.make_sentence())

An Agent will be with you shortly...^CB @SouthwestAir I love imagine dragons o flipping much, pls #DestinationDragons @SouthwestAir I would love tix to your show!
Pilots did a great job.
@JetBlue Thanks for the response.
Up &amp; away ✈️🌞 @USAirways Thank you @USAirways Got it, thanks!
None
None
@JetBlue - looking forward to watching the Oscars on my flight have directional dislexia.
None
@united thanks for the quick customer service today.
None


In [24]:
# Negative
# Ten randomly generated sentences
for i in range(10):
    print(negative_generator.make_sentence())

Said standby maybe 2days @united your website won't let me select online and there's a ridiculous wait time on 4348 to get a flight from a couple weeks now.
# totalfail #letdown @SouthwestAir you are a monopoly in @flySFO doesn't mean you treat your passengers like shit.
None
None
@SouthwestAir had a very rude gate agent at arrival MKE 2/23.
My claim number is expired because you WON'T find my bag @united I had to pay to sleep in Philly.
@AmericanAir here is the ticket they gave me travel credit but will not disclose what they are...
@united I'm try to upgrade my seats to first class our luggage was open and now missing valuable items that can't be rebooked online?Rep I called quoted me $1k higher than the website.
I'll most likely never book with you guys and your website won't allow me to wait on hold at your main number.
@americanair 30 minutes since landing - flight 1531 from Miami and no one answers at 800 number.


As we can see, the text generated reflects the overall sentiment of the tweets that it was generated from. This should not be surprising. What Markov Chains is, essentially, is looking at the previous few words to predict the word that can come next. Now, what if we looked at all the tweets? Let's find out. 

In [26]:
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
nlp.add_pipe(nlp.create_pipe('sentencizer'))
nlp.max_length = 20000000
doc = nlp(" ".join(twitter.text))

In [30]:
# Group into sentences
# Use the sentences that have more than one character
sents = " ".join([sent.text for sent in doc.sents if len(sent.text) > 1])
sents[0:500]

'@VirginAmerica What @dhepburn said. @VirginAmerica plus you\'ve added commercials to the experience... tacky. @VirginAmerica I didn\'t today... Must mean I need to take another trip! @VirginAmerica it\'s really aggressive to blast obnoxious "entertainment" in your guests\' faces &amp; they have little recourse @VirginAmerica and it\'s a really big bad thing about it @VirginAmerica seriously would pay $30 a flight for seats that didn\'t have this playing. \nit\'s really the only bad thing about flying VA'

In [28]:
generator = markovify.Text(sents, state_size = 3)

In [29]:
# Ten randomly generated sentences
for i in range(10):
    print(generator.make_sentence())

😄 @SouthwestAir Sort of, but I'm arriving a day and a half because of this!
I'm on hold for over a month to book a rwrd and was told no because i used points!
# UA6136 http://t.co/4JdvK8tCqx @united Do you have any sort of email confirmation that it’s been received.
@SouthwestAir Thank you for allowing me to see an agent to rebook my flight myself and am about to board their 5th plane of the day.
Evry 1st flight I take w/ you is delayed &amp; evry 2nd is on time when its 5 minutes past departure time and the attendants at MKE made my day!
Crew forgetting to load bags. expecting comp...
@AmericanAir Phone just disconnects if you stay on the plane, just as a carry on.
None
@united my flight out of BNA today?
No bad weather in RDU.


Well! As expected, the tweets aren't so negative as the purely negative tweets. ut since negative tweets far outnumber all other tweets, these tweets are still pretty negative. 