<a href="https://colab.research.google.com/github/IanPerigoVianna/George_Carlin_NLP/blob/main/NLP_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Projeto para estudo de NLP (Natural process language)

Vou fazer um web-scrapping da transcrição de apresentações do Humorista George Carlin no período de 1978 até 2024

Objetivo

Compar quais palavras são mais utilizadas pelo humorista e se ouve mudança da frequencia de determinadas palavras ao longo das décadas

In [1]:
import requests
from bs4 import BeautifulSoup
import pickle
import pandas as pd


In [6]:
def url_to_transcript(url):
  page = requests.get(url).text
  soup = BeautifulSoup(page , "lxml")
  text = [p.text for p in soup.find(class_="ast-container").find_all('p')]
  print (url)

  return text

urls = ['https://scrapsfromtheloft.com/comedy/george-carlin-1978-full-transcript/',
        'https://scrapsfromtheloft.com/comedy/carlin-at-carnegie-transcript/',
        'https://scrapsfromtheloft.com/comedy/george-carlin-doin-it-again-transcript/',
        'https://scrapsfromtheloft.com/comedy/george-carlin-dumb-americans-transcript/',
        'https://scrapsfromtheloft.com/comedy/george-carlin-politically-correct-language/']

# Anos dos transcript

anos = ['1978','1983','1990','2006','2024']


In [7]:
transcripts = [url_to_transcript(u) for u in urls]

https://scrapsfromtheloft.com/comedy/george-carlin-1978-full-transcript/
https://scrapsfromtheloft.com/comedy/carlin-at-carnegie-transcript/
https://scrapsfromtheloft.com/comedy/george-carlin-doin-it-again-transcript/
https://scrapsfromtheloft.com/comedy/george-carlin-dumb-americans-transcript/
https://scrapsfromtheloft.com/comedy/george-carlin-politically-correct-language/


In [8]:
#Gerar pickle file para usar depois
'''
for i,a in enumerate(anos):
  with open ('carlin' + a + ".txt",'wb') as file:
    pickle.dump(transcripts[i], file)
    '''

In [10]:
# Carregando pickle file

data = {}

for i, a in enumerate(anos):

  with open('carlin' + a + ".txt",'rb') as file:
    data[a] = pickle.load(file)



In [None]:
#Checando o arquivo txt
data['1978']

Vamos começar as técnicas de limpeza de texto (pré-processamento)

- Passar o texto inteiro para lower case
- remover pontuaçção
- remover valores numéricos
- remover espaços em branco
- tokenizar o texto
- remover stop words

depois da tokenização
- Lematização
- Marcação de partes do discurso
- bi-gramas e tri-gramas
- Erros de digitação


In [12]:
# Gerador para listar os títulos
list(iter(data))

['1978', '1983', '1990', '2006', '2024']

In [None]:
# Inspeção
# Nosso dicionário está assim : Key: data da apresentação , value: lista em formato de texto
next(iter(data.values()))

In [14]:
#Função para transformar lista de texto em string
def combine_txt(list_txt):
  combined_txt = ''.join(list_txt)
  return combined_txt


In [15]:
data_combined = {key:[combine_txt(value)] for (key,value) in data.items()}

In [None]:
pd.set_option('max_colwidth', 150)

#Transpor a chave para o índice e a coluna do carlin recebendo
#texto combinado do dicionário
data= pd.DataFrame.from_dict(data_combined).transpose()
data.columns = ['transcript']
data = data.sort_index()

data

In [None]:
data['transcript']['1978']

In [18]:
import re
import string

'''primeira limpeza:
- converter para lowercase
- remover pontuações
- remover texto entre parênteses
- remover palavras com números
'''
def clean_1(text):
  text = text.lower()
  text = re.sub('[.*?]', '', text)
  text= re.sub('[%s]'% re.escape(string.punctuation),'',text)
  text = re.sub('\w*\d\w*', '', text)

  return text

first_clean = lambda x: clean_1(x)

In [19]:
data_clean = pd.DataFrame(data.transcript.apply(first_clean))

In [21]:
data_clean['transcript']['1978']



In [22]:
def clean_2(text):
  text = re.sub('[‘’“”...]','', text)
  text= re.sub ('\n','',text)
  return text

second_clean = lambda x: clean_2(x)

In [23]:
data_clean = pd.DataFrame(data_clean.transcript.apply(clean_2))

data_clean

Unnamed: 0,transcript
1978,sometimes listed as on location george carlin at phoenixperformed at the celebrity star theater in phoenix on july this is george carlin and i ...
1983,recorded at carnegie hall new york city in released in s heard the old joke how do you get to carnegie hall practice man practice well like most ...
1990,recorded on january – state theatre new brunswick new jerseyso you want to talk about it oh yeah it all started in i mean thats when i started d...
2006,from life is worth losingrecorded on november beacon theater new york city new yorkits called the american dream because you have to be asleep t...
2024,i know im a little late with this but id like to get a few licks on this totally bogus topic before it completely disappears from everyones consci...


In [24]:
data_clean['transcript']['1978']



Organizando os dados
- corpo (coleção de texto)
- document-term matrix (contagem de palavras em formato de matriz)

In [27]:
data_clean.head()

Unnamed: 0,transcript
1978,sometimes listed as on location george carlin at phoenixperformed at the celebrity star theater in phoenix on july this is george carlin and i ...
1983,recorded at carnegie hall new york city in released in s heard the old joke how do you get to carnegie hall practice man practice well like most ...
1990,recorded on january – state theatre new brunswick new jerseyso you want to talk about it oh yeah it all started in i mean thats when i started d...
2006,from life is worth losingrecorded on november beacon theater new york city new yorkits called the american dream because you have to be asleep t...
2024,i know im a little late with this but id like to get a few licks on this totally bogus topic before it completely disappears from everyones consci...


In [None]:
data_clean.to_pickle('corpus.pkl')

In [29]:
data.to_pickle('dirty_corpus.pkl')

Document-Term Matrix

- Vamos usar o módulo CountVectorizer do scikit-learn
-Vamos tokenizar o texto em palavras
- Será removido as 'stop-words' em inglês
- Será criado uma matriz com todas palavras únicas como colunas
- Em cada linha de cada ano da apresentação terá a contagem de quantidade de repetição
  da palavra na coluna correspondente.



In [30]:
#Biblioteca

from sklearn.feature_extraction.text import CountVectorizer

In [72]:
cv = CountVectorizer(stop_words = 'english')

data_cv = cv.fit_transform(data_clean.transcript)

#Criando o DTM

data_dtm = pd.DataFrame(data_cv.toarray(), columns = cv.get_feature_names_out())



In [73]:
data_dtm

Unnamed: 0,able,abled,abortion,absolute,absolutely,abstract,absurd,accept,accepted,accident,...,youve,youyour,yoyo,zanzibar,zeeb,zip,zipper,zone,zones,zoo
0,1,0,0,0,0,0,0,0,0,1,...,8,1,0,0,0,0,0,0,1,1
1,0,0,1,0,1,0,0,2,0,3,...,8,0,0,0,0,0,0,1,0,0
2,2,2,1,1,4,1,0,1,1,2,...,3,0,1,1,1,1,1,0,0,0
3,0,0,0,0,0,0,0,1,0,0,...,1,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [35]:
# Serializar o dataframe tokenizado e removido stop words
# Depois passar o índice das respectivas apresentações para esse arquivo

data_dtm.to_pickle('tokenized_data.pkl')

In [38]:
data_dtm.index = ['1978','1983','1990','2006','2024']

In [40]:
data_dtm = data_dtm.transpose()

Vamos iniciar agora a análise exploratória de dados

- Verificar palavras mais frequêntes
- Tamanho do vocabulário
- Palavrões, gírias

In [41]:
data_dtm

Unnamed: 0,1978,1983,1990,2006,2024
able,1,0,2,0,0
abled,0,0,2,0,0
abortion,0,1,1,0,0
absolute,0,0,1,0,0
absolutely,0,1,4,0,0
...,...,...,...,...,...
zip,0,0,1,0,0
zipper,0,0,1,0,0
zone,0,1,0,0,0
zones,1,0,0,0,0


In [None]:
#Vamos pegar as 30 palavras mais frequêntes
top_dict = {}

for i in data_dtm.columns:
  top = data_dtm[i].sort_values(ascending = False).head(30)
  top_dict[i] = list(zip(top.index, top.values))

top_dict

In [48]:
#Listar as 15 palavras mais ditas em cada show

for year, top_15 in top_dict.items():
  print(year)
  print(', '.join([word for word, count in top_15[0:15]]))
  print('---')

1978
know, time, say, just, dont, thats, fuck, like, youre, word, words, people, man, think, little
---
1983
dont, know, like, oh, da, look, dog, fuck, just, little, boy, la, thats, say, theyre
---
1990
dont, say, like, know, people, got, little, im, thats, want, theyre, think, dog, things, lot
---
2006
people, big, fucking, got, want, dont, country, know, thats, shit, love, fat, malls, kids, just
---
2024
people, theyre, white, black, fat, color, dont, like, way, midgets, say, think, just, africa, citizen
---


Observamos que há algumas palavras com pouco sentido que podem ser adicionadas a lista de stop words para serem removidas que estão presente na maioria dos
shows dele como : 'like', 'thats'

In [49]:
# Vamos listar

from collections import Counter

In [None]:
words = []

for year in data_dtm.columns:
  top = [word for (word, count) in top_dict[year]]
  for t in top:
    words.append(t)

words



In [None]:
Counter(words).most_common()

In [69]:
stop_words_plus = [word for word, count in Counter(words).most_common() if count > 3]

stop_words_plus

['say',
 'just',
 'dont',
 'thats',
 'know',
 'like',
 'people',
 'think',
 'little',
 'shit',
 'im',
 'got',
 'theyre']

In [56]:
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import CountVectorizer

In [64]:
data_clean = pd.read_pickle('corpus.pkl')

In [71]:
data_clean

Unnamed: 0,transcript
1978,sometimes listed as on location george carlin at phoenixperformed at the celebrity star theater in phoenix on july this is george carlin and i ...
1983,recorded at carnegie hall new york city in released in s heard the old joke how do you get to carnegie hall practice man practice well like most ...
1990,recorded on january – state theatre new brunswick new jerseyso you want to talk about it oh yeah it all started in i mean thats when i started d...
2006,from life is worth losingrecorded on november beacon theater new york city new yorkits called the american dream because you have to be asleep t...
2024,i know im a little late with this but id like to get a few licks on this totally bogus topic before it completely disappears from everyones consci...


In [76]:
stop_words = list(text.ENGLISH_STOP_WORDS.union(stop_words_plus))



In [78]:
cv = CountVectorizer(stop_words=stop_words)
data_cv = cv.fit_transform(data_clean.transcript)

data_cv

<5x3704 sparse matrix of type '<class 'numpy.int64'>'
	with 5675 stored elements in Compressed Sparse Row format>

In [83]:
data_stop = pd.DataFrame(data_cv.toarray(), columns = cv.get_feature_names_out())
data_stop.index = data_clean.index

In [87]:
data_stop

Unnamed: 0,able,abled,abortion,absolute,absolutely,abstract,absurd,accept,accepted,accident,...,youve,youyour,yoyo,zanzibar,zeeb,zip,zipper,zone,zones,zoo
1978,1,0,0,0,0,0,0,0,0,1,...,8,1,0,0,0,0,0,0,1,1
1983,0,0,1,0,1,0,0,2,0,3,...,8,0,0,0,0,0,0,1,0,0
1990,2,2,1,1,4,1,0,1,1,2,...,3,0,1,1,1,1,1,0,0,0
2006,0,0,0,0,0,0,0,1,0,0,...,1,0,0,0,0,0,0,0,0,0
2024,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [88]:
#Vamos serializar esse data frame com as novas transformações
data_stop.to_pickle('dtm.pkl')

# vamos serializar os stop words

pickle.dump(cv, open("cv_stop.pkl","wb"))

Agora podemos analisar as palavras

vamos fazer uma nuvem de palavras

In [90]:
from wordcloud import WordCloud

In [91]:
wc = WordCloud(stopwords=stop_words, background_color="white", colormap="Dark2",
               max_font_size=150, random_state=42)