<a href="https://colab.research.google.com/github/IanPerigoVianna/George_Carlin_NLP/blob/main/NLP_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Projeto para estudo de NLP (Natural process language)

Vou fazer um web-scrapping da transcrição de apresentações do Humorista George Carlin no período de 1978 até 2024

Objetivo

Compar quais palavras são mais utilizadas pelo humorista e se ouve mudança da frequencia de determinadas palavras ao longo das décadas

In [22]:
import requests
from bs4 import BeautifulSoup
import pickle
import pandas as pd


In [2]:
def url_to_transcript(url):
  page = requests.get(url).text
  soup = BeautifulSoup(page , "lxml")
  text = [p.text for p in soup.find(class_="ast-container").find_all('p')]
  print (url)

  return text

urls = ['https://scrapsfromtheloft.com/comedy/george-carlin-1978-full-transcript/',
        'https://scrapsfromtheloft.com/comedy/carlin-at-carnegie-transcript/',
        'https://scrapsfromtheloft.com/comedy/george-carlin-doin-it-again-transcript/',
        'https://scrapsfromtheloft.com/comedy/george-carlin-dumb-americans-transcript/',
        'https://scrapsfromtheloft.com/comedy/george-carlin-politically-correct-language/']

# Anos dos transcript

anos = ['1978','1983','1990','2006','2024']


In [3]:
transcripts = [url_to_transcript(u) for u in urls]

https://scrapsfromtheloft.com/comedy/george-carlin-1978-full-transcript/
https://scrapsfromtheloft.com/comedy/carlin-at-carnegie-transcript/
https://scrapsfromtheloft.com/comedy/george-carlin-doin-it-again-transcript/
https://scrapsfromtheloft.com/comedy/george-carlin-dumb-americans-transcript/
https://scrapsfromtheloft.com/comedy/george-carlin-politically-correct-language/


In [4]:
#Gerar pickle file para usar depois

for i,a in enumerate(anos):
  with open ('carlin' + a + ".txt",'wb') as file:
    pickle.dump(transcripts[i], file)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [5]:
# Carregando pickle file

data = {}

for i, a in enumerate(anos):

  with open('carlin' + a + ".txt",'rb') as file:
    data[a] = pickle.load(file)

In [8]:
#Checando o arquivo txt
data['1978']

['* sometimes listed as On Location: George Carlin at Phoenix',
 'Performed at the Celebrity Star Theater in Phoenix on July 23, 1978',
 'Hi, this is George Carlin, and I thought we might take a look at some of the pictures from the days when my show business career was just starting. This is one of the earliest photos of my days as an actor. Here I’m playing the part of a baby in an early production of a play called “Hold Onto The Rail.” As proof of the intensity I brought to the role, lying nearby you can see a doll that I had recently strangled. This is a candid photo of my first manager and I having a business conference in the park, where we knew we couldn’t be bugged. In this photo I am trying out a new funny face that I had been working on for about six months. Now, here I am with, uh, two of my fellow actors from the West Harlem production of either Ben Hur or the Sound of Music. You can’t really tell from what we’re wearing there because those are our street clothes. And the p

Vamos começar as técnicas de limpeza de texto (pré-processamento)

- Passar o texto inteiro para lower case
- remover pontuaçção
- remover valores numéricos
- remover espaços em branco
- tokenizar o texto
- remover stop words

depois da tokenização
- Lematização
- Marcação de partes do discurso
- bi-gramas e tri-gramas
- Erros de digitação


In [10]:
# Gerador para listar os títulos
list(iter(data))

['1978', '1983', '1990', '2006', '2024']

In [12]:
# Inspeção
# Nosso dicionário está assim : Key: data da apresentação , value: lista em formato de texto
next(iter(data.values()))

['* sometimes listed as On Location: George Carlin at Phoenix',
 'Performed at the Celebrity Star Theater in Phoenix on July 23, 1978',
 'Hi, this is George Carlin, and I thought we might take a look at some of the pictures from the days when my show business career was just starting. This is one of the earliest photos of my days as an actor. Here I’m playing the part of a baby in an early production of a play called “Hold Onto The Rail.” As proof of the intensity I brought to the role, lying nearby you can see a doll that I had recently strangled. This is a candid photo of my first manager and I having a business conference in the park, where we knew we couldn’t be bugged. In this photo I am trying out a new funny face that I had been working on for about six months. Now, here I am with, uh, two of my fellow actors from the West Harlem production of either Ben Hur or the Sound of Music. You can’t really tell from what we’re wearing there because those are our street clothes. And the p

In [15]:
#Função para transformar lista de texto em string
def combine_txt(list_txt):
  combined_txt = ''.join(list_txt)
  return combined_txt


In [17]:
data_combined = {key:[combine_txt(value)] for (key,value) in data.items()}

In [43]:
pd.set_option('max_colwidth', 150)

#Transpor a chave para o índice e a coluna do carlin recebendo
#texto combinado do dicionário
data= pd.DataFrame.from_dict(data_combined).transpose()
data.columns = ['transcript']
data = data.sort_index()

data

Unnamed: 0,transcript
1978,"* sometimes listed as On Location: George Carlin at PhoenixPerformed at the Celebrity Star Theater in Phoenix on July 23, 1978Hi, this is George C..."
1983,"Recorded at Carnegie Hall, New York City in 1982, released in 1983.Everybody’s heard the old joke how do you get to Carnegie Hall; practice, man, ..."
1990,"Recorded on January 12–13, 1990, State Theatre, New Brunswick, New JerseySo you want to talk about it? Oh yeah. It all started in 1977. I mean, th..."
2006,"From Life Is Worth Losing\nRecorded on November 5, 2005, Beacon Theater, New York City, New York“It’s called the American dream because you have t..."
2024,"I know I’m a little late with this, but I’d like to get a few licks on this totally bogus topic before it completely disappears from everyone’s co..."


In [44]:
data['transcript']['1978']



In [45]:
import re
import string

'''primeira limpeza:
- converter para lowercase
- remover pontuações
- remover texto entre parênteses
- remover palavras com números
'''
def clean_1(text):
  text = text.lower()
  text = re.sub('[.*?]', '', text)
  text= re.sub('[%s]'% re.escape(string.punctuation),'',text)
  text = re.sub('\w*\d\w*', '', text)

  return text

first_clean = lambda x: clean_1(x)

In [46]:
data_clean = pd.DataFrame(data.transcript.apply(first_clean))

In [47]:
data_clean

Unnamed: 0,transcript
1978,sometimes listed as on location george carlin at phoenixperformed at the celebrity star theater in phoenix on july this is george carlin and i ...
1983,recorded at carnegie hall new york city in released in ’s heard the old joke how do you get to carnegie hall practice man practice well like most...
1990,recorded on january – state theatre new brunswick new jerseyso you want to talk about it oh yeah it all started in i mean that’s when i started ...
2006,from life is worth losing\nrecorded on november beacon theater new york city new york“it’s called the american dream because you have to be asle...
2024,i know i’m a little late with this but i’d like to get a few licks on this totally bogus topic before it completely disappears from everyone’s con...


In [48]:
data_clean['transcript']['1978']



In [52]:
def clean_2(text):
  text = re.sub('[‘’“”...]','', text)
  text= re.sub ('\n','',text)
  return text

second_clean = lambda x: clean_2(x)

In [53]:
data_clean = pd.DataFrame(data_clean.transcript.apply(clean_2))

data_clean

Unnamed: 0,transcript
1978,sometimes listed as on location george carlin at phoenixperformed at the celebrity star theater in phoenix on july this is george carlin and i ...
1983,recorded at carnegie hall new york city in released in s heard the old joke how do you get to carnegie hall practice man practice well like most ...
1990,recorded on january – state theatre new brunswick new jerseyso you want to talk about it oh yeah it all started in i mean thats when i started d...
2006,from life is worth losingrecorded on november beacon theater new york city new yorkits called the american dream because you have to be asleep t...
2024,i know im a little late with this but id like to get a few licks on this totally bogus topic before it completely disappears from everyones consci...


Organizando os dados
- corpo (coleção de texto)
- document-term matrix (contagem de palavras em formato de matriz)