In [1]:
import pandas as pd

df = pd.read_csv('../data/train.csv')

In [2]:
df.head()

Unnamed: 0,discourse_id,essay_id,discourse_text,discourse_type,discourse_effectiveness
0,0013cc385424,007ACE74B050,"Hi, i'm Isaac, i'm going to be writing about h...",Lead,Adequate
1,9704a709b505,007ACE74B050,"On my perspective, I think that the face is a ...",Position,Adequate
2,c22adee811b6,007ACE74B050,I think that the face is a natural landform be...,Claim,Adequate
3,a10d361e54e4,007ACE74B050,"If life was on Mars, we would know by now. The...",Evidence,Adequate
4,db3e453ec4e2,007ACE74B050,People thought that the face was formed by ali...,Counterclaim,Adequate


In [3]:
display(df.iloc[0, 2])

"Hi, i'm Isaac, i'm going to be writing about how this face on Mars is a natural landform or if there is life on Mars that made it. The story is about how NASA took a picture of Mars and a face was seen on the planet. NASA doesn't know if the landform was created by life on Mars, or if it is just a natural landform. "

In [4]:
df.sample(10)

Unnamed: 0,discourse_id,essay_id,discourse_text,discourse_type,discourse_effectiveness
15359,e828207173c1,0CA9AC7879C9,Reasons why is because the electorial college ...,Evidence,Adequate
16000,43568970fe23,15F80BF4B3C9,"this method is also less complicated, leaving ...",Claim,Effective
3391,fc35886ff045,3D3E4F0B74D2,It is not suprising that the temperature avera...,Evidence,Adequate
16762,536e94ee7f03,21A162D4B0CB,The Electoral College does not give a single v...,Claim,Effective
5311,3ebedeb0e4f6,6120EF6FC0BF,"However, in high school, it is better to give ...",Counterclaim,Effective
11225,64f03b4e19bb,C9AC065BB597,Do you belive in weird landforms?\n\nI believe...,Lead,Adequate
3519,ff34c4a3f81a,3F142A739C54,It's something positive for students to do dur...,Claim,Adequate
34516,354222fad153,5C69070E1E48,Students would benefit from being able to atte...,Claim,Adequate
5686,fa761541f68c,67D58F9FA53C,I also disagree because some students need to ...,Counterclaim,Adequate
12713,1d171d53a13c,E595E218D423,Just imagine how good of a idea this would be,Position,Adequate


In [5]:
display(df.loc[22709, 'discourse_text'])

"It is not a complicated system that people don't understand and works well with our country. Electoral College should stay, many would argue it is not a democratic modern sense but this country is a democracy so of course the system of which we pick our president has to be democratic. "

In [6]:
display(df.loc[7685, 'discourse_text'])

'The majority of Americans have the luxury of owning a car. A car of course, seems like a useful innovation. It gets you places quickly and efficently, and is easier than walking. Yet so many people drive cars that the roads get congested and clogged easily as the cars just idle in the road waiting for the traffic jam to loosen up. Car usage causes the enviroment to decline and an increase in air poulltion, '

In [7]:
df[df['essay_id'] == '8B696D4A84A0']

Unnamed: 0,discourse_id,essay_id,discourse_text,discourse_type,discourse_effectiveness
7685,148fe218d6a7,8B696D4A84A0,The majority of Americans have the luxury of o...,Lead,Effective
7686,c8b39930d9de,8B696D4A84A0,it would be a great idea to lower car usage si...,Position,Effective
7687,150912597748,8B696D4A84A0,An advantage to reducing car usage is a more b...,Claim,Effective
7688,4a83e5d52d55,8B696D4A84A0,"""Passenger cars are responsible for 12 percent...",Evidence,Adequate
7689,188bd2dca358,8B696D4A84A0,Cars also cause poulltion and smog to hang ove...,Evidence,Effective
7690,6d05034fdf90,8B696D4A84A0,With less car usage smog and pollution would d...,Claim,Adequate
7691,faf39d9cf27f,8B696D4A84A0,Driving not only creates an issue with air pol...,Claim,Effective
7692,872e8bde3bf3,8B696D4A84A0,"Cars run on gasoline, a fossil fuel. Which is ...",Evidence,Adequate
7693,d9d7c7c0ee16,8B696D4A84A0,Many people agree with the statment that with...,Evidence,Effective
7694,732f39a42bf7,8B696D4A84A0,Another advantage to reducing car usage is les...,Claim,Adequate


Using the discourse type is necessary. It influences how the discourse needs to be viewed. And in turn judged. This could mean we have to consider, a causal type of architecture. That means, based on the type the scores should also change. For this I should refer to the causal paper which had 4 learning objectives, 1 different, other 3 were the same. 

Also, try distilling the model. That is one of the key things. If we can distill it which will make it faster, then we can increase our ranking in the effectiveness ranking too!

In [8]:
import codecs
import os

from text_unidecode import unidecode
from typing import Tuple

In [9]:
COMP_DIR = '../data/'

In [10]:
def replace_encoding_with_utf8(error: UnicodeError) -> Tuple[bytes, int]:
    return error.object[error.start : error.end].encode("utf-8"), error.end


def replace_decoding_with_cp1252(error: UnicodeError) -> Tuple[str, int]:
    return error.object[error.start : error.end].decode("cp1252"), error.end

codecs.register_error("replace_encoding_with_utf8", replace_encoding_with_utf8)
codecs.register_error("replace_decoding_with_cp1252", replace_decoding_with_cp1252)


def resolve_encodings_and_normalize(text: str) -> str:
    text = (
        text.encode("raw_unicode_escape")
        .decode("utf-8", errors="replace_decoding_with_cp1252")
        .encode("cp1252", errors="replace_encoding_with_utf8")
        .decode("utf-8", errors="replace_decoding_with_cp1252")
    )
    
    text = unidecode(text)
    
    return text


def fetch_essay(essay_id: str, txt_dir: str):
    essay_path = os.path.join(COMP_DIR + txt_dir, essay_id + '.txt')
    essay_text = open(essay_path, 'r').read()
    
    return essay_text


def prepare_input(cfg, text, text_2=None):
    inputs = cfg.tokenizer(text, text_2,
                           padding="max_length",
                           add_special_tokens=True,
                           max_length=cfg.max_len,
                           truncation=True)

    for k, v in inputs.items():
        inputs[k] = torch.tensor(v, dtype=torch.long)
        
    return inputs


def inference_fn(test_loader, model, device):
    preds = []
    model.eval()
    model.to(device)
    tk0 = tqdm(test_loader, total=len(test_loader))
    
    for inputs in tk0:
        for k, v in inputs.items():
            inputs[k] = v.to(device)
            
        with torch.no_grad():
            output = model(inputs)
        
        preds.append(F.softmax(output).to('cpu').numpy())

    return np.concatenate(preds)  


def show_gradient(df, n_row=None):
    if not n_row:
        n_row = 5

    return df.head(n_row) \
                .assign(all_mean=lambda x: x.mean(axis=1)) \
                    .style.background_gradient(cmap=cm, axis=1)

In [17]:
df['discourse_text_UPD'] = df['discourse_text'].apply(resolve_encodings_and_normalize)

df['essay_text'] = df['essay_id'].transform(fetch_essay, txt_dir='train')
df['essay_text_UPD'] = df['essay_text'].apply(resolve_encodings_and_normalize)

df['discourse_text_UPD'] = df['discourse_text_UPD'].apply(lambda x: x.strip())
df['essay_text_UPD'] = df['essay_text_UPD'].apply(lambda x: x.strip())

df.sample(5)

Unnamed: 0,discourse_id,essay_id,discourse_text,discourse_type,discourse_effectiveness,discourse_text_UPD,essay_text,essay_text_UPD
2761,7a42f20f7a13,31F4831C794D,some people might say that is not true and if ...,Counterclaim,Effective,some people might say that is not true and if ...,Student-designed summer projects would be more...,Student-designed summer projects would be more...
5795,443d67140ccc,6A273D6F0816,"Overall, both students and teachers benefit wh...",Concluding Statement,Effective,"Overall, both students and teachers benefit wh...","It's the last day of school, and students are ...","It's the last day of school, and students are ..."
26552,0358ebd2f2f2,A47BE7BB6F08,The Electoral college is a proccess that was e...,Evidence,Ineffective,The Electoral college is a proccess that was e...,The Electoral college is a proccess that was e...,The Electoral college is a proccess that was e...
1577,90c9df281113,1DEDB886B240,The new technology called the Facial Action Co...,Position,Adequate,The new technology called the Facial Action Co...,The new technology called the Facial Action Co...,The new technology called the Facial Action Co...
14667,15c9fe0f67d1,04DEFB0B11C5,"Under the electoral collage system, voters vot...",Evidence,Ineffective,"Under the electoral collage system, voters vot...",The united states has this ellection system ca...,The united states has this ellection system ca...


In [18]:
def print_before_and_after_UPD(indx, df):
    data = df.loc[indx]
    disc_text = data.discourse_text
    disc_text_upd = data.discourse_text_UPD

    print(f'\nindex: {indx} ===')
    print(f'\n>>> origin text:')
    print(repr(disc_text))
    print(f'\n>>> updated text:')
    print(repr(disc_text_upd))

In [19]:
print_before_and_after_UPD(49, df)


index: 49 ===

>>> origin text:
"Often times throughout middle school, my teachers would give the students each other's assignments to grade themselves. And afterwards, we would see in the grade book\xa0that the grade changed because the students weren't grading the assignments the way they should have been. The student grading was meant to engage the classroom, but it turned into a showcase of how students dont really know how to grade the way teachers do. If students are unaware of how a project is supposed to be graded, they can not successfully design a project that tests whether or not they understand the material. "

>>> updated text:
"Often times throughout middle school, my teachers would give the students each other's assignments to grade themselves. And afterwards, we would see in the grade book that the grade changed because the students weren't grading the assignments the way they should have been. The student grading was meant to engage the classroom, but it turned into a

In [20]:
print_before_and_after_UPD(80, df)


index: 80 ===

>>> origin text:
'President Obama has done nothing to improve our society.\xa0 If anything, he has created the worst society this world has ever seen. \n'

>>> updated text:
'President Obama has done nothing to improve our society.  If anything, he has created the worst society this world has ever seen.'


In [21]:
print_before_and_after_UPD(945, df)


index: 945 ===

>>> origin text:
'The article says, ¨... humans have sent numerous spacecraft to land on this cloud-draped world,¨ and then says ¨... since no spacecraft survived the landing for more than a few hours.¨ If a spacecraft made by people can not survive the landing on Venus numerous times, how can we even come close to surviving the trip? Venus is too dangerous to explore and it is a waste to even send a spacecraft. The spacecrafts cost a good amount of money to make and for them to be wasted on a inhospitable planet is not worth the information we would achieve. \n'

>>> updated text:
'The article says, "... humans have sent numerous spacecraft to land on this cloud-draped world," and then says "... since no spacecraft survived the landing for more than a few hours." If a spacecraft made by people can not survive the landing on Venus numerous times, how can we even come close to surviving the trip? Venus is too dangerous to explore and it is a waste to even send a spacecr

In [23]:
id2label = {
    0: 'Ineffective',
    1: 'Adequate',
    2: 'Effective',
}
label2id = dict((v, k) for k, v in id2label.items())

label2id

{'Ineffective': 0, 'Adequate': 1, 'Effective': 2}

In [24]:
df['labels'] = df['discourse_effectiveness'].apply(lambda x: label2id[x])
df.head()

Unnamed: 0,discourse_id,essay_id,discourse_text,discourse_type,discourse_effectiveness,discourse_text_UPD,essay_text,essay_text_UPD,labels
0,0013cc385424,007ACE74B050,"Hi, i'm Isaac, i'm going to be writing about h...",Lead,Adequate,"Hi, i'm Isaac, i'm going to be writing about h...","Hi, i'm Isaac, i'm going to be writing about h...","Hi, i'm Isaac, i'm going to be writing about h...",1
1,9704a709b505,007ACE74B050,"On my perspective, I think that the face is a ...",Position,Adequate,"On my perspective, I think that the face is a ...","Hi, i'm Isaac, i'm going to be writing about h...","Hi, i'm Isaac, i'm going to be writing about h...",1
2,c22adee811b6,007ACE74B050,I think that the face is a natural landform be...,Claim,Adequate,I think that the face is a natural landform be...,"Hi, i'm Isaac, i'm going to be writing about h...","Hi, i'm Isaac, i'm going to be writing about h...",1
3,a10d361e54e4,007ACE74B050,"If life was on Mars, we would know by now. The...",Evidence,Adequate,"If life was on Mars, we would know by now. The...","Hi, i'm Isaac, i'm going to be writing about h...","Hi, i'm Isaac, i'm going to be writing about h...",1
4,db3e453ec4e2,007ACE74B050,People thought that the face was formed by ali...,Counterclaim,Adequate,People thought that the face was formed by ali...,"Hi, i'm Isaac, i'm going to be writing about h...","Hi, i'm Isaac, i'm going to be writing about h...",1


In [29]:
shuffled_df = df.sample(frac=1)

In [30]:
shuffled_df[:2000].to_parquet(COMP_DIR+'test_UPD.parquet')
shuffled_df[2000:].to_parquet(COMP_DIR+'train_UPD.parquet')

In [26]:
sample_df = df.sample(100)

sample_df[:20]

Unnamed: 0,discourse_id,essay_id,discourse_text,discourse_type,discourse_effectiveness,discourse_text_UPD,essay_text,essay_text_UPD,labels
18724,34231c29f74a,39F8DB13BB4F,I think that driverless cars are a bad idea be...,Concluding Statement,Adequate,I think that driverless cars are a bad idea be...,How safe would you feel if you there was a car...,How safe would you feel if you there was a car...,1
26022,48eae7707264,9DF8A166D993,not only basic people of the USA think the ele...,Claim,Adequate,not only basic people of the USA think the ele...,The Electoral College\n\nDear Mr. Nixon\n\nI t...,The Electoral College\n\nDear Mr. Nixon\n\nI t...,1
5739,2ea574c21e6b,68D87C1576F3,In this article they talk about long ago when ...,Evidence,Adequate,In this article they talk about long ago when ...,The author tells us a lot about Venus in the a...,The author tells us a lot about Venus in the a...,1
4402,74d6929d41c4,4E2B21AF8C1D,because sometimes people do not want show they...,Claim,Adequate,because sometimes people do not want show they...,Why they invent Facial Action Coding System? W...,Why they invent Facial Action Coding System? W...,1
31283,3aa15cc835f0,E74E4865B395,Many schools started online classes due to eme...,Counterclaim,Adequate,Many schools started online classes due to eme...,Many schools throughout the world offer distan...,Many schools throughout the world offer distan...,1
13015,b5c9fcaa868a,EAF6BCA737BD,The mechanical computers could be made to widt...,Evidence,Adequate,The mechanical computers could be made to widt...,"In the article ""The Challenge od Exploring Ven...","In the article ""The Challenge od Exploring Ven...",1
28207,09159e5e79d2,B7E3F460802A,First of all not only are candidates wanting t...,Evidence,Ineffective,First of all not only are candidates wanting t...,Senator i argue that the state shouald change ...,Senator i argue that the state shouald change ...,0
24093,4c50cce4cf72,811DE32FCE53,"Conclusively, the Electoral College is not the...",Concluding Statement,Adequate,"Conclusively, the Electoral College is not the...","Dear State Senator,\n\nI do not like the Elect...","Dear State Senator,\n\nI do not like the Elect...",1
3312,dc3f884db3d3,3BFD98581812,this is however a bad way to get there for the...,Evidence,Ineffective,this is however a bad way to get there for the...,in the artical the autoer shows how they would...,in the artical the autoer shows how they would...,0
12649,679dca5b7ed3,E47590B3F2CF,You also have a lot of free time on the way ba...,Claim,Adequate,You also have a lot of free time on the way ba...,"You should become a Seagoing Cowboy like me, L...","You should become a Seagoing Cowboy like me, L...",1


In [27]:
sample_df[:20].to_parquet('../data/sample_test.parquet')
sample_df[20:].to_parquet('../data/sample_train.parquet')