## Freature Extraction

Corpus from: [HERE](https://www.kaggle.com/mauroebordon/creating-a-qa-corpus-from-askreddit/)

Only Tokens, Lemmas, PoS AND stopword filtering for now.

In [5]:
import pandas as pd
import re
from nltk import word_tokenize, pos_tag
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
stop = stopwords.words('english')


def extract_features(df: pd.DataFrame, toks=True, lems=True, pos=True):

    """
    """
    lmtzr = WordNetLemmatizer()
   
    #Filtramos las preguntas demasiado extensas
    db_df = df.copy()[df.Q.apply(lambda x: len(str(x)) <50)]

    #remove stopwords
    pattern = re.compile(r'\b(' + r'|'.join(stopwords.words('english')) + r')\b\s*')        
    db_df["Qless"] = db_df.Q.str.replace(pattern, '')
    db_df["Aless"] = db_df.ANS.str.replace(pattern, '')
        
    #Obtenemos los tokens
    if toks:
        db_df["Qtoks"] = [word_tokenize(w) for w in db_df["Qless"]]
        db_df["Atoks"] = [word_tokenize(w) for w in db_df["Aless"]]

    #Obtenemos los lemmas
    if lems:
        db_df["Qlemmas"] = [' '.join(lmtzr.lemmatize(t) for t in qes) for qes in db_df["Qtoks"]]
        db_df["Alemmas"] = [' '.join(lmtzr.lemmatize(t) for t in qes) for qes in db_df["Atoks"]]


    # Par Tok & POS
    if pos:
        db_df["Qpos"] = [pos_tag(word_tokenize(w)) for w in db_df["Qless"]]
        db_df["Apos"] = [pos_tag(word_tokenize(w)) for w in db_df["Aless"]]

    
    
    db_df = db_df[["id", "Q", "Qscore", "Qless", "Qlemmas", "Qpos", "Qtoks", "ANS", "ANSscore", "Aless", "Alemmas", "Apos", "Atoks"]]

    return db_df
    

db_df = pd.read_csv("ask-reddit-corpus.csv", index_col=0)


#usar pickle no sirve para comprimir naranja
df = extract_features(db_df)
df

Unnamed: 0,id,Q,Qscore,Qless,Qlemmas,Qpos,Qtoks,ANS,ANSscore,Aless,Alemmas,Apos,Atoks
4,100b8y,"First time drinking, what do I do?",7,"First time drinking, I ?","First time drinking , I ?","[(First, JJ), (time, NN), (drinking, NN), (,, ...","[First, time, drinking, ,, I, ?]","try not mix drinks.., if you start with spirit...",12.0,"try mix drinks.., start spirits finish spirits..","try mix drink .. , start spirit finish spirit ..","[(try, VB), (mix, JJ), (drinks, NNS), (.., VBP...","[try, mix, drinks, .., ,, start, spirits, fini..."
10,100nsq,Should I or should I not upgrade to Windows 8?,5,Should I I upgrade Windows 8?,Should I I upgrade Windows 8 ?,"[(Should, MD), (I, PRP), (I, PRP), (upgrade, V...","[Should, I, I, upgrade, Windows, 8, ?]",how about dualboot?,6.0,dualboot?,dualboot ?,"[(dualboot, NN), (?, .)]","[dualboot, ?]"
38,102xlc,What makes you smile?,123,What makes smile?,What make smile ?,"[(What, WP), (makes, VBZ), (smile, NN), (?, .)]","[What, makes, smile, ?]",zygomatic major and orbicularis oculi,172.0,zygomatic major orbicularis oculi,zygomatic major orbicularis oculus,"[(zygomatic, JJ), (major, JJ), (orbicularis, N...","[zygomatic, major, orbicularis, oculi]"
46,103c5q,What was your ego boost today?,5,What ego boost today?,What ego boost today ?,"[(What, WP), (ego, VBZ), (boost, NN), (today, ...","[What, ego, boost, today, ?]",Yesterday I was told that I look like Emma Sto...,5.0,Yesterday I told I look like Emma Stone. I fee...,Yesterday I told I look like Emma Stone . I fe...,"[(Yesterday, NN), (I, PRP), (told, VBD), (I, P...","[Yesterday, I, told, I, look, like, Emma, Ston..."
47,103hv3,What is the worst way to die you can think of?,199,What worst way die think ?,What worst way die think ?,"[(What, WP), (worst, JJS), (way, NN), (die, NN...","[What, worst, way, die, think, ?]",there was a video from a few years ago of a ch...,201.0,video years ago chechnian guy getting throat c...,video year ago chechnian guy getting throat cu...,"[(video, CD), (years, NNS), (ago, RB), (chechn...","[video, years, ago, chechnian, guy, getting, t..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...
410527,ozgvlf,How did you get rid of depression?,1,How get rid depression?,How get rid depression ?,"[(How, WRB), (get, VB), (rid, JJ), (depression...","[How, get, rid, depression, ?]","I can't, I can only reduce my sadness by worki...",1.0,"I ', I reduce sadness working something I like","I ' , I reduce sadness working something I like","[(I, PRP), (', ''), (,, ,), (I, PRP), (reduce,...","[I, ', ,, I, reduce, sadness, working, somethi..."
410528,ozgw90,What is your kink?,1,What kink?,What kink ?,"[(What, WP), (kink, VB), (?, .)]","[What, kink, ?]",Snuff films.,1.0,Snuff films.,Snuff film .,"[(Snuff, NNP), (films, NNS), (., .)]","[Snuff, films, .]"
410532,ozgwry,What time do you wake up and go to sleep?,1,What time wake go sleep?,What time wake go sleep ?,"[(What, WP), (time, NN), (wake, NN), (go, VB),...","[What, time, wake, go, sleep, ?]","Wake up in 5am, sleep in 8pm",1.0,"Wake 5am, sleep 8pm","Wake 5am , sleep 8pm","[(Wake, VB), (5am, CD), (,, ,), (sleep, VBP), ...","[Wake, 5am, ,, sleep, 8pm]"
410534,ozgww8,What's the coolest thing about this generation?,1,What'coolest thing generation?,What'coolest thing generation ?,"[(What'coolest, JJS), (thing, NN), (generation...","[What'coolest, thing, generation, ?]",Me and you\nYour momma and your cousin too,1.0,Me Your momma cousin,Me Your momma cousin,"[(Me, NNP), (Your, NNP), (momma, NN), (cousin,...","[Me, Your, momma, cousin]"


In [6]:
df.to_csv("features.csv")