# Compare delta datasets

Are they different wrt our lexical scores?

In [1]:
import pandas as pd
from utils import build_source_df, scale_rows

scores_to_display = ["compression_ratio", "pos_compression_ratio", "stopword_density", "number_of_tokens", "self_bleu", "lix_score"]

build_source_df(overwrite=True)
source_df = pd.read_csv("lex_lix_source.csv")
source_df = source_df[source_df.score.isin(scores_to_display)]
source_df = source_df.apply(scale_rows, axis=1)

source_df

Unnamed: 0,delta,language,score,value
2,all,both,number_of_tokens/100,1.515254
3,all,both,stopword_density,0.519447
4,all,both,self_bleu,0.193784
5,all,both,compression_ratio,1.531424
6,all,both,pos_compression_ratio,3.009525
...,...,...,...,...
259,nonfiction,nno,lix_score/10,3.324232
260,translated,nno,lix_score/10,1.828231
261,fiction,nno,lix_score/10,2.643742
262,newspapers,nno,lix_score/10,3.820227


In [2]:
from dash import Dash, html, dcc, callback, Output, Input
import plotly.express as px

df = source_df

app = Dash()

app.layout = [
    html.H1(children='Lexical scores across delta datasets', style={'textAlign':'center', "color": "White"}),
    dcc.Dropdown(df.language.unique(), 'both', id='dropdown-language'),
    dcc.Graph(id='graph-content')
]

@callback(
    Output('graph-content', 'figure'),
    Input('dropdown-language', 'value')
)
def update_graph(selected_language):
    dff = df[df.language==selected_language]
    return px.bar(dff, x='score', y='value', color='delta', barmode="group")
if __name__ == '__main__':
    app.run(debug=True)


## Conclusion: Yes, there is variation in lexical scores between the delta datasets

- We see that the texts from factual and newspapers are much shorter than the others. This affexts the linguistic diversity scores.   
- The stopword density seems to be fairly similar across delta datasets. Self-Bleu scores vary from 0.05 (nonfiction) to 0.13 (translated).  
- Compression ratio is similar across deltas (1.6x), except for the short outliers factual and newspapers.  
- The part-of-speech compression ratio seems to reflect the delta text lengths exactly.  
- The translated dataset has the lowest lix score (easiest to read) and the nonfiction dataset has the highest (hardest to read). 