## Preprocessing pipeline for social media data

### TEXT NORMALIZATION
Text normalization is a possible solution for overcoming or reducing linguistic noise. The task can be approached in two stages: first, the identification of orthographic errors in an input text, and second, the correction of these errors. Normalization approaches typically include a dictio- nary of known correctly spelled terms, and detects in-vocabulary and out-of-vocabulary (OOV) terms with respect to this dictionary. The normalization can be basic or more advanced. Basic normalization deals with the errors detected at the POS tagging stage, such as unknown words, misspelled words, etc. Advanced normalization is more flexible, taking a lightly supervised au- tomatic approach trained on an external dataset (annotated with short forms vs. their equivalent long or corrected forms).

In [1]:
import sys

In [2]:
sys.prefix == sys.base_prefix 

False

In [17]:
!pip3 install -r requirements.txt

Collecting emoji==1.4.2
  Using cached emoji-1.4.2.tar.gz (184 kB)
Building wheels for collected packages: emoji


  Building wheel for emoji (setup.py) ... [?25ldone
[?25h  Created wheel for emoji: filename=emoji-1.4.2-py3-none-any.whl size=186452 sha256=9ca569453c30d8212b76392a4a81b2b9b208aea79fa0ca92eacc54f76001590a
  Stored in directory: /Users/dark_prince/Library/Caches/pip/wheels/71/4d/3c/cada364d4ea0026deee7208dee1e61bcebd20aa2ae5dc154ba
Successfully built emoji
Installing collected packages: emoji
Successfully installed emoji-1.4.2
You should consider upgrading via the '/Users/dark_prince/Desktop/DissoCodes/venv/bin/python -m pip install --upgrade pip' command.[0m


In [18]:
# imports we are going to use for preprocessing

#imports 
import pandas as pd
import numpy as np
import math
from praw.models import MoreComments
from nltk.corpus import stopwords
#from nltk.tokenize import RegexTokenizer
#from util.tokenize import SentTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
#from nltk import FreqDistq
import spacy

import string
import demoji
demoji.download_codes()
from nltk.tokenize import TweetTokenizer

  demoji.download_codes()


In [21]:
!python -m spacy download en_core_web_sm


Collecting en-core-web-sm==3.1.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.1.0/en_core_web_sm-3.1.0-py3-none-any.whl (13.6 MB)
[K     |████████████████████████████████| 13.6 MB 1.8 MB/s eta 0:00:01
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.1.0
You should consider upgrading via the '/Users/dark_prince/Desktop/DissoCodes/venv/bin/python -m pip install --upgrade pip' command.[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [76]:
# pyforest imports all the usual libraries we use Pandas , sckitlearn, numpy etc
#import pyforest
import emoji
import re
import en_core_web_sm

pd.set_option('display.max_colwidth', None)

In [77]:
#unpack pickle file 
df = pd.read_pickle("exampleTweets.pkl")

df

Unnamed: 0,id,conversation_id,created_at,date,timezone,place,tweet,language,hashtags,cashtags,...,geo,source,user_rt_id,user_rt,retweet_id,reply_to,retweet_date,translate,trans_src,trans_dest
0,1420829994545262592,1420829994545262592,1.627587e+12,2021-07-29 20:34:03,+0100,,Giga Texas: 6 months ago vs this week https://t.co/CAgmIJ3wKQ,en,[],[],...,,,,,,[],,,,
1,1420120550756831233,1420120550756831233,1.627418e+12,2021-07-27 21:34:58,+0100,,Music video filmed using Tesla Sentry Mode https://t.co/g7ZYub2NH5,en,[],[],...,,,,,,[],,,,
2,1416793350750056448,1416793350750056448,1.626625e+12,2021-07-18 17:13:52,+0100,,"Solar Roof is designed to withstand heavy storms, hail and even medium-size tree branches https://t.co/IvTROrPC0j",en,[],[],...,,,,,,[],,,,
3,1412548803119435780,1412548803119435780,1.625613e+12,2021-07-07 00:07:33,+0100,,"You can stream Netflix &amp; YouTube, play video games or sing Caraoke in your Tesla 📺🎮🎤",en,[],[],...,,,,,,[],,,,
4,1411104242367143948,1411104240383168512,1.625268e+12,2021-07-03 00:27:23,+0100,,"To activate, tap Software → long press model name → type holidays or ModelXMas",en,[],[],...,,,,,,[],,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,1184205373449879552,1184202345170817024,1.571172e+12,2019-10-15 21:32:17,+0100,,@HarryStoltz1 F,und,[],[],...,,,,,,"[{'screen_name': 'HarryStoltz1', 'name': 'Harry Stoltz🚀', 'id': '727955767567884289'}]",,,,
96,1184202345170817024,1184202345170817024,1.571171e+12,2019-10-15 21:20:15,+0100,,Do you want a sneak peek of our crash lab? Of course you do. It’s one of the reasons why Model 3 is among the safest cars on the road. https://t.co/jQxWizz0ea,en,[],[],...,,,,,,[],,,,
97,1181980583070748672,1181977669556211712,1.570641e+12,2019-10-09 18:11:45,+0100,,@Super_modi_ we're good,en,[],[],...,,,,,,[],,,,
98,1181978026344697856,1181977669556211712,1.570640e+12,2019-10-09 18:01:36,+0100,,@Super_modi_ CAR,und,[],[],...,,,,,,[],,,,


**Corpora Annotation and Utilities**




Text corpora are annotated with rich metadata, which is extremely useful for getting valuable insights when utilizing the corpora for NLP and text analytics. Popular annotations for text corpora include tagging parts of speech (POS) tags, word stems, lemmas, and many more. Here are some of the most used methods and techniques for annotating text corpora:


**POS tagging:** This is mainly used to annotate each word with a POS tag indicating the part of speech associated with it. <br>
**Word stems:** A stem for a word is a part of the word to which various affixes can be attached.<br>
**Word lemmas:** A lemma is the canonical or base form for a set of words and is also known as the head word.<br>
**Dependency grammar:** This includes finding out the various relationships among the components in sentences and annotating the dependencies.<br>
**Constituency grammar:** This is used to add syntactic annotation to sentences based on their constituents including phrases and clauses.<br>
**Semantic types and roles:** The various constituents of sentences including words and phrases are annotated with specific semantic types and roles, often obtained from an ontology, which indicates what they do. These include things like place, person, time, organization, agent, recipient, theme, and so forth.

### The preprocessing is being done on the "tweet" as that is section we want to process

### In our instance, we would include the following steps for our preprocessing:
Lower case <br>
Removing Emojis <br>
Tokenizing, removing links etc. <br>
Removing stopwords Normalizing words via lemmatizing <br>

### We are converting the "tweets" into a list to make it easier for preprocessing

In [81]:
newdf = df[str('tweet')]  

newdf

0     Giga Texas: 6 months ago vs this week  https://t.co/CAgmIJ3wKQ                                                                                                    
1     Music video filmed using Tesla Sentry Mode  https://t.co/g7ZYub2NH5                                                                                               
2     Solar Roof is designed to withstand heavy storms, hail and even medium-size tree branches  https://t.co/IvTROrPC0j                                                
3     You can stream Netflix &amp; YouTube, play video games or sing Caraoke in your Tesla 📺🎮🎤                                                                          
4     To activate, tap Software → long press model name → type holidays or ModelXMas                                                                                    
                                           ...                                                                                                             

In [82]:
pd.set_option("display.max_colwidth", -1)

  pd.set_option("display.max_colwidth", -1)


### Dont run this function -- this is a more complex function to process complicated url's that otherwise re cant handle

In [306]:
def process_URLs(transient_tweet_text):
	'''
	replace all URLs in the tweet text
	'''
	UrlStart1 = regex_or('https?://', r'www\.',r'bit.ly/')
	CommonTLDs = regex_or('com','co\\.uk','org','net','info','ca','biz','info','edu','in','au')
	UrlStart2 = r'[a-z0-9\.-]+?' + r'\.' + CommonTLDs + pos_lookahead(r'[/ \W\b]')
	UrlBody = r'[^ \t\r\n<>]*?'  # * not + for case of:  "go to bla.com." -- don't want period
	UrlExtraCrapBeforeEnd = '%s+?' % regex_or(PunctChars, Entity)
	UrlEnd = regex_or( r'\.\.+', r'[<>]', r'\s', '$')
	Url = 	(optional(r'\b') + 
    		regex_or(UrlStart1, UrlStart2) + 
    		UrlBody + 
    pos_lookahead( optional(UrlExtraCrapBeforeEnd) + UrlEnd))

	Url_RE = re.compile("(%s)" % Url, re.U|re.I)
	transient_tweet_text = re.sub(Url_RE, " constanturl ", transient_tweet_text)

	# Fix to handle unicodes in URL.

	URL_regex2 = r'\b(htt)[p\:\/]*([\\x\\u][a-z0-9]*)*'
	transient_tweet_text = re.sub(URL_regex2, " constanturl ", transient_tweet_text)
	return transient_tweet_text

In [83]:
result = re.sub(r"http\S+", "", str(newdf))

result

"0     Giga Texas: 6 months ago vs this week                                                                                                      \n1     Music video filmed using Tesla Sentry Mode                                                                                                 \n2     Solar Roof is designed to withstand heavy storms, hail and even medium-size tree branches                                                  \n3     You can stream\xa0Netflix &amp; YouTube, play video games or sing Caraoke\xa0in your Tesla 📺🎮🎤                                                                          \n4     To activate, tap Software → long press model name → type holidays or ModelXMas                                                                                    \n                                           ...                                                                                                                          \n95    @HarryStoltz1 F                     

In [84]:
#function to clear spelling mistakes


def prune_multple_consecutive_same_char(transient_tweet_text):
	'''
	yesssssssss  is converted to yess 
	ssssssssssh is converted to ssh
	'''
	transient_tweet_text = re.sub(r'(.)\1+', r'\1\1', transient_tweet_text)
	return transient_tweet_text

In [85]:
prune_multple_consecutive_same_char(result)

"0  Giga Texas: 6 months ago vs this week  \n1  Music video filmed using Tesla Sentry Mode  \n2  Solar Roof is designed to withstand heavy storms, hail and even medium-size tree branches  \n3  You can stream\xa0Netflix &amp; YouTube, play video games or sing Caraoke\xa0in your Tesla 📺🎮🎤  \n4  To activate, tap Software → long press model name → type holidays or ModelXMas  \n  ..  \n95  @HarryStoltz1 F  \n96  Do you want a sneak peek of our crash lab?  Of course you do.  It’s one of the reasons why Model 3 is among the safest cars on the road.  \n97  @Super_modi_ we're good  \n98  @Super_modi_ CAR  \n99  @Super_modi_ CAR-aoke guys  \nName: tweet, Length: 100, dtype: object"

In [86]:
result = result.split("\n")
result

['0     Giga Texas: 6 months ago vs this week                                                                                                      ',
 '1     Music video filmed using Tesla Sentry Mode                                                                                                 ',
 '2     Solar Roof is designed to withstand heavy storms, hail and even medium-size tree branches                                                  ',
 '3     You can stream\xa0Netflix &amp; YouTube, play video games or sing Caraoke\xa0in your Tesla 📺🎮🎤                                                                          ',
 '4     To activate, tap Software → long press model name → type holidays or ModelXMas                                                                                    ',
 '                                           ...                                                                                                                          ',
 '95    @HarryStoltz1 F  

In [37]:
#removing all emojis

def remove_emoji(transient_tweet_text):
    '''
    replace emoji with the respective emotion
    '''
    tweet_tokenizer = TweetTokenizer()
    tokenized_tweet = tweet_tokenizer.tokenize(transient_tweet_text)
    emojis_present = demoji.findall(transient_tweet_text)
    tweet_no_emoji=''
    for i,s in enumerate(tokenized_tweet):
        if s in emojis_present.keys():
            tweet_no_emoji = tweet_no_emoji + ' ' + emojis_present[s]
        else:
            tweet_no_emoji = tweet_no_emoji + ' ' + s
    return tweet_no_emoji

In [38]:
remove_emoji(str(result))

' [ \' 0 Giga Texas : 6 months ago vs this week \' , \' 1 Music video filmed using Tesla Sentry Mode \' , \' 2 Solar Roof is designed to withstand heavy storms , hail and even medium-size tree branches \' , \' 3 You can stream \\ xa0Netflix & YouTube , play video games or sing Caraoke \\ xa0in your Tesla television video game microphone \' , \' 4 To activate , tap Software → long press model name → type holidays or ModelXMas \' , \' ... \' , \' 95 @HarryStoltz1 F \' , \' 96 Do you want a sneak peek of our crash lab ? Of course you do . It ’ s one of the reasons why Model 3 is among the safest cars on the road . \' , " 97 @Super_modi_ we\'re good " , \' 98 @Super_modi_ CAR \' , \' 99 @Super_modi_ CAR-aoke guys \' , \' Name : tweet , Length : 100 , dtype : object \' ]'

In [39]:
from pandas import DataFrame

In [40]:
result = DataFrame (result,columns=['tweets'])

result

Unnamed: 0,tweets
0,0 Giga Texas: 6 months ago vs this week
1,1 Music video filmed using Tesla Sentry Mode
2,"2 Solar Roof is designed to withstand heavy storms, hail and even medium-size tree branches"
3,"3 You can stream Netflix &amp; YouTube, play video games or sing Caraoke in your Tesla 📺🎮🎤"
4,"4 To activate, tap Software → long press model name → type holidays or ModelXMas"
5,...
6,95 @HarryStoltz1 F
7,96 Do you want a sneak peek of our crash lab? Of course you do. It’s one of the reasons why Model 3 is among the safest cars on the road.
8,97 @Super_modi_ we're good
9,98 @Super_modi_ CAR


In [41]:
new_result = result['tweets'].to_string() 



In [42]:
type(new_result)

str

In [87]:
def to_LowerCase(transient_tweet_text):
    '''
	Convert tweet text to lower to lower case alphabets
	'''
    transient_tweet_text = transient_tweet_text.lower()
    return transient_tweet_text

In [88]:
new_result = to_LowerCase(new_result)

new_result

" 0 0 giga texas : 6 months ago vs this week 1 1 music video filmed using tesla sentry mode 2 2 solar roof is designed to withstand heavy storms , hail and even medium-size tree branches 3 3 you can stream netflix & youtube , play video games or sing caraoke in your tesla television video game microphone 4 4 to activate , tap software → long press model name → type holidays or modelxmas 5 ... 6 95 @harrystoltz1 f 7 96 do you want a sneak peek of our crash lab ? of course you do . it ’ s one of the reasons why model 3 is among the safest cars on the road . 8 97 @super_modi_ we're good 9 98 @super_modi_ car 10 99 @super_modi_ car-aoke guys 11 name : tweet , length : 100 , dtype : object"

In [89]:
#removing all emojis

def remove_emoji(transient_tweet_text):
    '''
    replace emoji with the respective emotion
    '''
    tweet_tokenizer = TweetTokenizer()
    tokenized_tweet = tweet_tokenizer.tokenize(transient_tweet_text)
    emojis_present = demoji.findall(transient_tweet_text)
    tweet_no_emoji=''
    for i,s in enumerate(tokenized_tweet):
        if s in emojis_present.keys():
            tweet_no_emoji = tweet_no_emoji + ' ' + emojis_present[s]
        else:
            tweet_no_emoji = tweet_no_emoji + ' ' + s
    return tweet_no_emoji

In [90]:
new_result = remove_emoji(new_result)

new_result

" 0 0 giga texas : 6 months ago vs this week 1 1 music video filmed using tesla sentry mode 2 2 solar roof is designed to withstand heavy storms , hail and even medium-size tree branches 3 3 you can stream netflix & youtube , play video games or sing caraoke in your tesla television video game microphone 4 4 to activate , tap software → long press model name → type holidays or modelxmas 5 ... 6 95 @harrystoltz1 f 7 96 do you want a sneak peek of our crash lab ? of course you do . it ’ s one of the reasons why model 3 is among the safest cars on the road . 8 97 @super_modi_ we're good 9 98 @super_modi_ car 10 99 @super_modi_ car-aoke guys 11 name : tweet , length : 100 , dtype : object"

In [91]:
def strip_unicode(transient_tweet_text):
    '''
    Strip all unicode characters from a tweet
    '''
    tweet = ''.join(i for i in transient_tweet_text if ord(i)<128)
    return tweet 

In [92]:
clean_tweets = strip_unicode(new_result)

In [93]:
clean_tweets


" 0 0 giga texas : 6 months ago vs this week 1 1 music video filmed using tesla sentry mode 2 2 solar roof is designed to withstand heavy storms , hail and even medium-size tree branches 3 3 you can stream netflix & youtube , play video games or sing caraoke in your tesla television video game microphone 4 4 to activate , tap software  long press model name  type holidays or modelxmas 5 ... 6 95 @harrystoltz1 f 7 96 do you want a sneak peek of our crash lab ? of course you do . it  s one of the reasons why model 3 is among the safest cars on the road . 8 97 @super_modi_ we're good 9 98 @super_modi_ car 10 99 @super_modi_ car-aoke guys 11 name : tweet , length : 100 , dtype : object"

In [94]:

def process_Mentions(transient_tweet_text):
    '''
	Identify mentions if any
	'''
    transient_tweet_text = re.sub(r"@(\w+)"," ", transient_tweet_text)
    return transient_tweet_text

In [95]:
process_Mentions(clean_tweets)

" 0 0 giga texas : 6 months ago vs this week 1 1 music video filmed using tesla sentry mode 2 2 solar roof is designed to withstand heavy storms , hail and even medium-size tree branches 3 3 you can stream netflix & youtube , play video games or sing caraoke in your tesla television video game microphone 4 4 to activate , tap software  long press model name  type holidays or modelxmas 5 ... 6 95   f 7 96 do you want a sneak peek of our crash lab ? of course you do . it  s one of the reasons why model 3 is among the safest cars on the road . 8 97   we're good 9 98   car 10 99   car-aoke guys 11 name : tweet , length : 100 , dtype : object"

In [99]:
def process_HashTags(transient_tweet_text):
    '''
	Strip all Hashtags from a tweet
	'''
    transient_tweet_text = re.sub(r"#(\w+)\b", '', transient_tweet_text)
    return transient_tweet_text

In [100]:
clean_tweet = process_HashTags(clean_tweets)

clean_tweet

" 0 0 giga texas : 6 months ago vs this week 1 1 music video filmed using tesla sentry mode 2 2 solar roof is designed to withstand heavy storms , hail and even medium-size tree branches 3 3 you can stream netflix & youtube , play video games or sing caraoke in your tesla television video game microphone 4 4 to activate , tap software  long press model name  type holidays or modelxmas 5 ... 6 95 @harrystoltz1 f 7 96 do you want a sneak peek of our crash lab ? of course you do . it  s one of the reasons why model 3 is among the safest cars on the road . 8 97 @super_modi_ we're good 9 98 @super_modi_ car 10 99 @super_modi_ car-aoke guys 11 name : tweet , length : 100 , dtype : object"

In [101]:
type(clean_tweet)

str

### Do function to clean an automate diff parts


source for some parts https://catriscode.com/2021/05/01/tweets-cleaning-with-python/

In [119]:
clean_tweet = re.sub("#\S+", " ", clean_tweet)


clean_tweet

" 0 0 giga texas : 6 months ago vs this week 1 1 music video filmed using tesla sentry mode 2 2 solar roof is designed to withstand heavy storms , hail and even medium-size tree branches 3 3 you can stream netflix & youtube , play video games or sing caraoke in your tesla television video game microphone 4 4 to activate , tap software  long press model name  type holidays or modelxmas 5 ... 6 95 @harrystoltz1 f 7 96 do you want a sneak peek of our crash lab ? of course you do . it  s one of the reasons why model 3 is among the safest cars on the road . 8 97 @super_modi_ we're good 9 98 @super_modi_ car 10 99 @super_modi_ car-aoke guys 11 name : tweet , length : 100 , dtype : object"

In [318]:
def clean_text(text):
    pass

In [103]:
mentions = re.findall("@([a-zA-Z0-9_]{1,50})", clean_tweet)

print(mentions)

['harrystoltz1', 'super_modi_', 'super_modi_', 'super_modi_']


### Finding & Removing hastags

In [104]:

#no hastags in our df but works for hashtags

hashtags = re.findall("#([a-zA-Z0-9_]{1,50})", clean_tweet)
print(hashtags)

# removing hashtags
clean_tweet = re.sub("#[A-Za-z0-9_]+","", clean_tweet)

[]


## Removing mentions

In [160]:
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

clean_tweets = re.sub("@[A-Za-z0-9_]+","", clean_tweet)

clean_tweets

" 0 0 giga texas : 6 months ago vs this week 1 1 music video filmed using tesla sentry mode 2 2 solar roof is designed to withstand heavy storms , hail and even medium-size tree branches 3 3 you can stream netflix & youtube , play video games or sing caraoke in your tesla television video game microphone 4 4 to activate , tap software  long press model name  type holidays or modelxmas 5 ... 6 95  f 7 96 do you want a sneak peek of our crash lab ? of course you do . it  s one of the reasons why model 3 is among the safest cars on the road . 8 97  we're good 9 98  car 10 99  car-aoke guys 11 name : tweet , length : 100 , dtype : object"

In [159]:
#Removing all the Non-Alphanumeric characters

clean_tweets = re.sub("[^a-z0-9]"," ", clean_tweets)

clean_tweets

' 0 0 giga texas   6 months ago vs this week 1 1 music video filmed using tesla sentry mode 2 2 solar roof is designed to withstand heavy storms   hail and even medium size tree branches 3 3 you can stream netflix   youtube   play video games or sing caraoke in your tesla television video game microphone 4 4 to activate   tap software  long press model name  type holidays or modelxmas 5     6 95  f 7 96 do you want a sneak peek of our crash lab   of course you do   it  s one of the reasons why model 3 is among the safest cars on the road   8 97  we re good 9 98  car 10 99  car aoke guys 11 name   tweet   length   100   dtype   object'

In [161]:
#Removing numbers and whitespace and convert to lowercase

def clean_text(text):
    # remove numbers
    text_nonum = re.sub(r'\d+', '', text)
    # substitute multiple whitespace with single whitespace
    # remove punctuations and convert characters to lower case
    text_nopunct = "".join([char.lower() for char in text_nonum if char not in string.punctuation]) 
    # Also, removes leading and trailing whitespaces
    text_no_doublespace = re.sub('\s+', ' ', text_nopunct).strip()
    return text_no_doublespace

clean_tweets = clean_text(clean_tweets)


clean_tweets

'giga texas months ago vs this week music video filmed using tesla sentry mode solar roof is designed to withstand heavy storms hail and even mediumsize tree branches you can stream netflix youtube play video games or sing caraoke in your tesla television video game microphone to activate tap software long press model name type holidays or modelxmas f do you want a sneak peek of our crash lab of course you do it s one of the reasons why model is among the safest cars on the road were good car caraoke guys name tweet length dtype object'

In [112]:

from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/dark_prince/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

### remove stopwords

source https://stackabuse.com/removing-stop-words-from-strings-in-python

In [162]:


all_stopwords = stopwords.words('english')
all_stopwords.remove('not')
clean_tweets = ' '.join([word for word in clean_tweets.split(' ') if word not in all_stopwords])
print(clean_tweets)

giga texas months ago vs week music video filmed using tesla sentry mode solar roof designed withstand heavy storms hail even mediumsize tree branches stream netflix youtube play video games sing caraoke tesla television video game microphone activate tap software long press model name type holidays modelxmas f want sneak peek crash lab course one reasons model among safest cars road good car caraoke guys name tweet length dtype object


##### STEMMING 

Word stems are also often known as the base form of a word and we can create new words by attaching affixes to them. This process is known as inflection. The reverse of this is obtaining the base form of a word from its inflected form and this is known as stemming. Consider the word “JUMP”, you can add affixes to it and form several new words like “JUMPS”, “JUMPED”, and “JUMPING”. In this case, the base word is “JUMP” and this is the word stem.


![title](IMG/picture.png)

In [163]:
def simple_stemmer(text):
    ps = nltk.porter.PorterStemmer()
    text = ' '.join([ps.stem(word) for word in text.split()])
    return text

clean_tweets = simple_stemmer(clean_tweets)

clean_tweets

'giga texa month ago vs week music video film use tesla sentri mode solar roof design withstand heavi storm hail even mediums tree branch stream netflix youtub play video game sing caraok tesla televis video game microphon activ tap softwar long press model name type holiday modelxma f want sneak peek crash lab cours one reason model among safest car road good car caraok guy name tweet length dtype object'

### Lemmatization
The process of lemmatization is very similar to stemming, where we remove word affixes to get to a base form of the word. However in this case, this base form is also known
as the root word but not the root stem.

https://towardsdatascience.com/a-beginners-guide-to-stemming-in-natural-language-processing-34ddee4acd37

In [141]:
import nltk
nltk.download('wordnet')




[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/dark_prince/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [116]:
# Init the Wordnet Lemmatizer
lemmatizer = WordNetLemmatizer()
lemmatized_output = ' '.join([lemmatizer.lemmatize(w) for w in clean_tweets ])


lemmatized_output

'g i g a   t e x a   m o n t h   a g o   v s   w e e k   m u s i c   v i d e o   f i l m   u s e   t e s l a   s e n t r i   m o d e   s o l a r   r o o f   d e s i g n   w i t h s t a n d   h e a v i   s t o r m   h a i l   e v e n   m e d i u m s   t r e e   b r a n c h   s t r e a m   n e t f l i x   y o u t u b   p l a y   v i d e o   g a m e   s i n g   c a r a o k   t e s l a   t e l e v i s   v i d e o   g a m e   m i c r o p h o n   a c t i v   t a p   s o f t w a r   l o n g   p r e s s   m o d e l   n a m e   t y p e   h o l i d a y   m o d e l x m a   f   w a n t   s n e a k   p e e k   c r a s h   l a b   c o u r s   o n e   r e a s o n   m o d e l   a m o n g   s a f e s t   c a r   r o a d   g o o d   c a r   c a r a o k   g u y   n a m e   t w e e t   l e n g t h   d t y p e   o b j e c t'

### Tokenise and remove non-essential tokens (e.g. punctuation)

**Tokenization** can be defined as the process of breaking down or splitting textual data into smaller and more meaningful components called tokens. In the following sections, we look at some ways to tokenize text into sentences and words.

**Word tokenization** Soo is the process of splitting or segmenting sentences into their constituent words. A sentence is a collection of words and with tokenization we essentially split a sentence into a list of words that can be used to reconstruct the sentence. Word tokenization is really important in many processes, especially in cleaning and normalizing text where operations like stemming and lemmatization work on each individual word based on its respective stems and lemma. Similar to sentence tokenization, NLTK provides various useful interfaces for word tokenization. We will touch up on the following main interfaces:
• word_tokenize <br>
• TreebankWordTokenizer<br>
• TokTokTokenizer<br>
• RegexpTokenizer<br>
• Inherited tokenizers from RegexpTokenizer<br>

We are using **NLTK_word_tokenize** for our purpose 


In [164]:
# we are using nltk word_tokenize to tokenize the corpus 
default_wt = nltk.word_tokenize

tweets_wt = default_wt(clean_tweets)

np.array(tweets_wt)

array(['giga', 'texa', 'month', 'ago', 'vs', 'week', 'music', 'video',
       'film', 'use', 'tesla', 'sentri', 'mode', 'solar', 'roof',
       'design', 'withstand', 'heavi', 'storm', 'hail', 'even', 'mediums',
       'tree', 'branch', 'stream', 'netflix', 'youtub', 'play', 'video',
       'game', 'sing', 'caraok', 'tesla', 'televis', 'video', 'game',
       'microphon', 'activ', 'tap', 'softwar', 'long', 'press', 'model',
       'name', 'type', 'holiday', 'modelxma', 'f', 'want', 'sneak',
       'peek', 'crash', 'lab', 'cours', 'one', 'reason', 'model', 'among',
       'safest', 'car', 'road', 'good', 'car', 'caraok', 'guy', 'name',
       'tweet', 'length', 'dtype', 'object'], dtype='<U9')

In [24]:
import matplotlib.pyplot as plt
pd.options.display.max_colwidth = 200
%matplotlib inline

In [1]:
import pandas as pd


In [220]:
def deEmojify(string_uncleaned):
    regrex_pattern = re.compile(pattern = "["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
        u"\U00002702-\U000027B0"
        u"\U000024C2-\U0001F251"
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
        u"\U00002500-\U00002BEF"  # chinese char
        u"\U00002702-\U000027B0"
        u"\U00002702-\U000027B0"
        u"\U000024C2-\U0001F251"
        u"\U0001f926-\U0001f937"
        u"\U00010000-\U0010ffff"
        u"\u2640-\u2642"
        u"\u2600-\u2B55"
        u"\u200d"
        u"\u23cf"
        u"\u23e9"
        u"\u231a"
        u"\ufe0f"  # dingbats
        u"\u3030"                        
                    
                           "]+", flags = re.UNICODE)
    return regrex_pattern.sub(r'',string_uncleaned)





In [224]:
deEmojify(new_result)

'0      0     Giga Texas: 6 months ago vs this week                                                                                                                                                                                                                            \n1      1     Music video filmed using Tesla Sentry Mode                                                                                                                                                                                                                       \n2      2     Solar Roof is designed to withstand heavy storms, hail and even medium-size tree branches                                                                                                                                                                        \n3      3     You can stream\xa0Netflix &amp; YouTube, play video games or sing Caraoke\xa0in your Tesla                                                                                

In [156]:
#function to clear spelling mistakes

def prune_multple_consecutive_same_char(transient_tweet_text):
	'''
	yesssssssss  is converted to yess 
	ssssssssssh is converted to ssh
	'''
	transient_tweet_text = re.sub(r'(.)\1+', r'\1\1', transient_tweet_text)
	return transient_tweet_text

In [87]:
#
URLless_string = re.sub(r'\w+:\/{2}[\d\w-]+(\.[\d\w-]+)*(?:(?:\/[^\s/]*))*', '', result)


result

"0     0     Giga Texas: 6 months ago vs this week                                                                                                                             \n1     1     Music video filmed using Tesla Sentry Mode                                                                                                                        \n2     2     Solar Roof is designed to withstand heavy storms, hail and even medium-size tree branches                                                                         \n3     3     You can stream\xa0Netflix &amp; YouTube, play video games or sing Caraoke\xa0in your Tesla 📺🎮🎤                                                                          \n4     4     To activate, tap Software → long press model name → type holidays or ModelXMas                                                                                    \n5                                                ...                                                             

### Lowercasing using lambda function 

In [92]:
newdf = df['tweet'].to_list()

newdf

['gigatexas:6monthsagovsthisweekhttps://t.co/cagmij3wkq',
 'musicvideofilmedusingteslasentrymodehttps://t.co/g7zyub2nh5',
 'solarroofisdesignedtowithstandheavystorms,hailandevenmedium-sizetreebrancheshttps://t.co/ivtrorpc0j',
 'youcanstreamnetflix&amp;youtube,playvideogamesorsingcaraokeinyourtesla📺🎮🎤',
 'toactivate,tapsoftware→longpressmodelname→typeholidaysormodelxmas',
 'allmodelxcandance📸@thechristina99@michellehellmanhttps://t.co/qjh8qljzwk',
 'production&amp;deliveriesinq2surpassed200,000vehicleshttps://t.co/xyloa0jhx7',
 'experiencingplaidhttps://t.co/nipaxkbkjy',
 'https://t.co/v9mfaesy1v',
 'https://t.co/k1h6oqse47',
 'playingcyberpunkinmodelsplaidhttps://t.co/y9ev4f6eau',
 'presentationstartsat8:15pmpacifichttps://t.co/tgzjumralohttps://t.co/xw1s6atqus',
 'modelsplaiddeliveryeventatourfremontfactorywillbestreamedliveonjune10,7pmpacifichttps://t.co/v7c77ysfti',
 'justsurpassed200kpowerwallinstallsglobally🔋🏡☀️',
 'ifyouwanttohelpcovergigaberlininawesomegraffitiart,sendusyourwork