<a href="https://colab.research.google.com/github/IanPerigoVianna/George_Carlin_NLP/blob/main/NLP_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Projeto para estudo de NLP (Natural process language)

Vou fazer um web-scrapping da transcrição de apresentações do Humorista George Carlin no período de 1978 até 2024

Objetivo

Compar quais palavras são mais utilizadas pelo humorista e se ouve mudança da frequencia de determinadas palavras ao longo das décadas

In [22]:
import requests
from bs4 import BeautifulSoup
import pickle
import pandas as pd


In [2]:
def url_to_transcript(url):
  page = requests.get(url).text
  soup = BeautifulSoup(page , "lxml")
  text = [p.text for p in soup.find(class_="ast-container").find_all('p')]
  print (url)

  return text

urls = ['https://scrapsfromtheloft.com/comedy/george-carlin-1978-full-transcript/',
        'https://scrapsfromtheloft.com/comedy/carlin-at-carnegie-transcript/',
        'https://scrapsfromtheloft.com/comedy/george-carlin-doin-it-again-transcript/',
        'https://scrapsfromtheloft.com/comedy/george-carlin-dumb-americans-transcript/',
        'https://scrapsfromtheloft.com/comedy/george-carlin-politically-correct-language/']

# Anos dos transcript

anos = ['1978','1983','1990','2006','2024']


In [3]:
transcripts = [url_to_transcript(u) for u in urls]

https://scrapsfromtheloft.com/comedy/george-carlin-1978-full-transcript/
https://scrapsfromtheloft.com/comedy/carlin-at-carnegie-transcript/
https://scrapsfromtheloft.com/comedy/george-carlin-doin-it-again-transcript/
https://scrapsfromtheloft.com/comedy/george-carlin-dumb-americans-transcript/
https://scrapsfromtheloft.com/comedy/george-carlin-politically-correct-language/


In [4]:
#Gerar pickle file para usar depois

for i,a in enumerate(anos):
  with open ('carlin' + a + ".txt",'wb') as file:
    pickle.dump(transcripts[i], file)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [5]:
# Carregando pickle file

data = {}

for i, a in enumerate(anos):

  with open('carlin' + a + ".txt",'rb') as file:
    data[a] = pickle.load(file)

In [None]:
#Checando o arquivo txt
data['1978']

Vamos começar as técnicas de limpeza de texto (pré-processamento)

- Passar o texto inteiro para lower case
- remover pontuaçção
- remover valores numéricos
- remover espaços em branco
- tokenizar o texto
- remover stop words

depois da tokenização
- Lematização
- Marcação de partes do discurso
- bi-gramas e tri-gramas
- Erros de digitação


In [10]:
# Gerador para listar os títulos
list(iter(data))

['1978', '1983', '1990', '2006', '2024']

In [None]:
# Inspeção
# Nosso dicionário está assim : Key: data da apresentação , value: lista em formato de texto
next(iter(data.values()))

In [15]:
#Função para transformar lista de texto em string
def combine_txt(list_txt):
  combined_txt = ''.join(list_txt)
  return combined_txt


In [17]:
data_combined = {key:[combine_txt(value)] for (key,value) in data.items()}

In [None]:
pd.set_option('max_colwidth', 150)

#Transpor a chave para o índice e a coluna do carlin recebendo
#texto combinado do dicionário
data= pd.DataFrame.from_dict(data_combined).transpose()
data.columns = ['transcript']
data = data.sort_index()

data

In [None]:
data['transcript']['1978']

In [45]:
import re
import string

'''primeira limpeza:
- converter para lowercase
- remover pontuações
- remover texto entre parênteses
- remover palavras com números
'''
def clean_1(text):
  text = text.lower()
  text = re.sub('[.*?]', '', text)
  text= re.sub('[%s]'% re.escape(string.punctuation),'',text)
  text = re.sub('\w*\d\w*', '', text)

  return text

first_clean = lambda x: clean_1(x)

In [46]:
data_clean = pd.DataFrame(data.transcript.apply(first_clean))

In [47]:
data_clean

Unnamed: 0,transcript
1978,sometimes listed as on location george carlin at phoenixperformed at the celebrity star theater in phoenix on july this is george carlin and i ...
1983,recorded at carnegie hall new york city in released in ’s heard the old joke how do you get to carnegie hall practice man practice well like most...
1990,recorded on january – state theatre new brunswick new jerseyso you want to talk about it oh yeah it all started in i mean that’s when i started ...
2006,from life is worth losing\nrecorded on november beacon theater new york city new york“it’s called the american dream because you have to be asle...
2024,i know i’m a little late with this but i’d like to get a few licks on this totally bogus topic before it completely disappears from everyone’s con...


In [None]:
data_clean['transcript']['1978']

In [52]:
def clean_2(text):
  text = re.sub('[‘’“”...]','', text)
  text= re.sub ('\n','',text)
  return text

second_clean = lambda x: clean_2(x)

In [None]:
data_clean = pd.DataFrame(data_clean.transcript.apply(clean_2))

data_clean

Organizando os dados
- corpo (coleção de texto)
- document-term matrix (contagem de palavras em formato de matriz)