# Implementacion Módulo 3

In [34]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns 
import re

In [35]:
df = pd.read_csv("Datasets/processed_df.csv", sep=";")

In [36]:
df["post"] = df["post"].astype(str)
df["title"] = df["title"].astype(str)
df["title_post"] = df["title"] + " " + df["post"]

In [37]:
re_subreddits = re.compile("|".join(list(map(lambda x: x.lower(), df.subreddit.unique()))))

re_subreddits 

re.compile(r'artificial|statistics|machinelearning|computervision|rstats|analytics|datasets|computerscience|askstatistics|data|datascience|mlquestions|datasciencejobs|deeplearning|dataengineering|dataanalysis|learnmachinelearning|kaggle|datascienceproject',
           re.UNICODE)

El operador or | nos sirve para definir esta expresión regular.

In [38]:
def find_subreddit_mentions(text: str):
    re_subreddits = re.compile("|".join(list(map(lambda x: x.lower(), df.subreddit.unique()))))
    return " ".join(re_subreddits.findall(str(text).lower()))


In [39]:
df["subreddit_mentions"] = df["title_post"].apply(lambda x: find_subreddit_mentions(x.lower()))

Hay muchas url diferentes en el dataset por lo que vamos a intentar incluir caracteres que hemos visto que aparecen también dentro de las urls

In [40]:
def url_extracion(text: str):
    re_url = re.compile(r'https?\:\/\/[\w\-]+(?:\.[\w\-]+)+(?:[\/\w\-\.\?\=\&\#]*)?')
    return re.findall(re_url, str(text))


In [41]:
df["urls"] = df["title_post"].apply(url_extracion)

Hay varias formas diferentes de definir un número de teléfono en el dataset. Asi que vamos a crear una expresión regular general en la que se tenga un posible prefijo y diferentes separadores posibles.

In [42]:
def phone_number_extracion(text: str):
    re_phone_number = re.compile(r'(?:\+\d{1,3}[\s\-]?)?[0-9]{3}[\s\-]?\d{3}[\s\-]?\d{4}')
    return re_phone_number.findall(str(text).lower())

In [43]:
df["phone_numbers"] = df["post"].apply(lambda x: phone_number_extracion(x.lower()))

In [44]:
df[df.phone_numbers.astype(str) != "[]"]

Unnamed: 0,created_date,subreddit,title,author,full_link,score,post,sentiment,lemmatized_post,stemmed_post,clean_post,clean_title,title_post,subreddit_mentions,urls,phone_numbers
28,2009-10-28 20:28:27,statistics,Ask Stats: Good Introductory Book (or websites...,ST2K,https://www.reddit.com/r/statistics/comments/9...,9,I own a copy of [Bayesian Statistics: An Intro...,0,copy bayesian statistic introduction little bi...,copi bayesian statist introduct littl bit diff...,copy bayesian statistics introduction little b...,ask stats good introductory book website bayes...,Ask Stats: Good Introductory Book (or websites...,statistics statistics statistics statistics,[http://www.amazon.com/Bayesian-Statistics-Int...,[0340814055]
202,2010-09-05 05:04:59,artificial,Scientific study proving basically the exact t...,ithkuil,https://www.reddit.com/r/artificial/comments/d...,0,http://www.sciencedaily.com/releases/2010/09/1...,0,gt ancestral structure likely group densely pa...,gt ancestr structur like group dens pack cell ...,gt ancestral structure likely group densely pa...,scientific study proving basically exact thing...,Scientific study proving basically the exact t...,artificial,[http://www.sciencedaily.com/releases/2010/09/...,[1009021210]
264,2010-10-25 22:31:18,statistics,"Probability of the game ""Set""... please help",NaLaurethSulfate,https://www.reddit.com/r/statistics/comments/d...,2,"So I am not very good at statistics, have take...",0,good statistic taken college poor high school ...,good statist taken colleg poor high school cov...,good statistics taken college poor high school...,probability game set please help,"Probability of the game ""Set""... please help S...",statistics,[http://www.setgame.com/set/index.html],"[2658227848, 0632911392, 1265822784, 253164556..."
354,2011-01-04 08:30:23,MachineLearning,Ask ML: Document ranking with user ratings?,eggbrain,https://www.reddit.com/r/MachineLearning/comme...,10,I've had a fun idea for awhile (not for profit...,0,fun idea awhile profit entertainment keep runn...,fun idea awhil profit entertain keep run barri...,fun idea awhile profit entertainment keep runn...,ask ml document ranking user rating,Ask ML: Document ranking with user ratings? I'...,data data,[http://en.wikipedia.org/wiki/Learning_to_rank...,[9780596529]
359,2011-01-06 04:51:18,analytics,"DotCed - Functional Web Analytics - Tagging, R...",dotced,https://www.reddit.com/r/analytics/comments/ew...,1,"DotCed,a Functional Analytics Consultant, offe...",0,dotced functional analytics consultant offerin...,dotc function analyt consult offer googl analy...,dotced functional analytics consultant offerin...,dotced functional web analytics tagging report...,"DotCed - Functional Web Analytics - Tagging, R...",analytics analytics analytics,[],[919-404-9233]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
273985,2022-05-07 06:59:25,dataengineering,Consolidating many data tables from BLS’s API ...,bongdong42O,https://www.reddit.com/r/dataengineering/comme...,1,"Hey, I’m having trouble figuring out a way to ...",0,hey trouble figuring way get ton get requested...,hey troubl figur way get ton get request info ...,hey trouble figuring way get ton get requested...,consolidating many data table bls api data fac...,Consolidating many data tables from BLS’s API ...,data data statistics data data data data data,"[https://www.bls.gov/help/hlpforma.htm#NB, htt...","[2053000000, 0000019007, 2053000000, 0000033030]"
274065,2022-05-07 21:02:36,computervision,Morphological Operators usage,ErIndi,https://www.reddit.com/r/computervision/commen...,1,Hello everyone!\n\n&amp;#x200B;\n\nI have post...,0,hello everyone amp posted past project working...,hello everyon amp post past project work curre...,hello everyone amp posted past project working...,morphological operator usage,Morphological Operators usage Hello everyone!\...,computervision,[https://www.reddit.com/r/computervision/comme...,"[7196596905, 1333630640]"
274133,2022-05-08 07:29:03,computerscience,Python Programming Character Pairs please help!,Stockolorian,https://www.reddit.com/r/computerscience/comme...,1,Does anyone know how to read a text file in py...,0,anyone know read text file python count many c...,anyon know read text file python count mani ch...,anyone know read text file python count many c...,python programming character pair please help,Python Programming Character Pairs please help...,,[https://preview.redd.it/fgqkz4sej6y81.png?wid...,[9734740847]
274139,2022-05-08 08:51:33,datasets,"[self-promotion] Hey all, we are running a dat...",Kobedoggg,https://www.reddit.com/r/datasets/comments/ukv...,1,Check out our expressions of interest [LINK](h...,1,check expression interest link li activity lea...,check express interest link li activ learn amp...,check expressions interest link li activity le...,self promotion hey running data challenge vari...,"[self-promotion] Hey all, we are running a dat...",data datasets datasets,[https://www.linkedin.com/feed/update/urn],[6927760517]


El código es muy dificil de detectar. Pensamos en capturar las funciones con alguna expresión regular pero entonces no podríamos sacar lo de dentro porque no se tiene ningún indicio de donde acaba la función. Sin embargo, hemos visto que en muchos posts el código se pone entre tildes.

In [45]:
def code_extraction(text:str):
    re_code = re.compile(r'```(.*?)```')
    return re_code.findall(str(text))

In [46]:
df["title_post_code"] = df["title_post"].apply(lambda x: code_extraction(x))

In [47]:
df["title_post_code"].sum()

['a_ij',
 'pi',
 'f_i(t)',
 'f_i(t)',
 'a_ij',
 'pi',
 'f_i(t)',
 'f_i(t)',
 'x_i',
 'f(x|x_i)',
 'f(x|x_i)',
 'f(x|x_i)',
 'x_i',
 '--save_resume',
 'pairs(iris)',
 'lm()',
 'aov()',
 'waiting = 80',
 'waiting = 80',
 'dplyr',
 'hflights',
 'UniqueCarrier',
 'UniqueCarrier',
 'UniqueCarrier',
 'lut',
 '[ ]',
 'UniqueCarrier',
 'df',
 'if you are a PC',
 '(message body)',
 'aov()',
 'anova()',
 'fitnet',
 'fitnet',
 'hiddenSizes',
 'trainFcn',
 'airport_1,airport_2,flight_volume',
 "JFK,O'Hare,1015",
 'INPUT -&gt; [CONV -&gt; RELU -&gt; CONV -&gt; RELU -&gt; POOL]*3 -&gt; [FC -&gt; RELU]*2 -&gt; FC',
 'description, size',
 'hey this is a 15 1/2 inch item with measurementB of 12 inches, 12 1/2',
 'description, size',
 'measurementB of 12, 12',
 '12 measurementB, 12',
 'max_threshold',
 'min_threshold',
 'anchors = [0.57273, 0.677385, 1.87446, 2.06253, 3.33843, 5.47434, 7.88282, 3.52778, 9.77052, 9.16828]',
 '',
 '',
 '',
 '',
 '',
 '',
 ' Label [ top-left x and y coordinates, bottom-rig

Vemos como si se ha capturado algún código pero no es muy exacto el sistema.