<a href="https://colab.research.google.com/github/Alexwcjung/Spring2024/blob/main/Corpus/Lexical-Diversity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🌿 Topics:

## 1. Type vs. Token
## 2. **Lexical Diversity measures (10 types)**

## Getting LD indices

+ **TTR (Type-Token Ratio)**: Measures the ratio of the number of unique words (types) to the total number of words (tokens) in a text, indicating vocabulary diversity. *Reference: (Richards, 1987)*

+ **RTTR (Root Type-Token Ratio)**: An adaptation of TTR that takes the square root of the number of tokens to reduce the impact of text length. *Reference: (Guiraud, 1954)*

+ **LogTTR (Logarithmic Type-Token Ratio)**: A measure that involves taking the logarithm of the number of types and dividing it by the logarithm of the number of tokens. *Reference: (Herdan, 1960)*

+ **MassTTR**: A lexical diversity measure accounting for the number of different words in large segments of the text. *Reference: (Covington & McFall, 2010)*

+ **MSTTR (Mean Segmental Type-Token Ratio)**: Calculates TTR for multiple segments of a given length within the text and averages the results. *Reference: (Johnson, 1944)*

+ **MATTR (Moving-Average Type-Token Ratio)**: A measure that calculates TTR over a moving window, averaging the TTR values across the text. *Reference: (Covington & McFall, 2010)*

+ **HDD (Hypergeometric Distribution D)**: Estimates the probability of encountering new types in a sample of text, based on a hypergeometric distribution. *Reference: (McCarthy & Jarvis, 2007)*

+ **MTLD (Measure of Textual Lexical Diversity)**: Measures lexical diversity by dividing the text into segments and calculating the average length at which a certain TTR level is maintained. *Reference: (McCarthy, 2005)*

+ **MTLD_wrap**: A variant of MTLD that wraps around to the beginning of the text when it reaches the end, for a more comprehensive analysis. *Reference: (McCarthy, 2005)*

+ **MTLD_bid (Bi-directional MTLD)**: An extension of MTLD that calculates lexical diversity in both forward and reverse directions in the text. *Reference: (McCarthy, 2005)*


## Read data file from Github, Add string length, Number of splitted words (N_Splits)

+ data from https://github.com/MK316/Workingpapers/blob/main/Analysis/ksatdata_12only.csv
+ Raw data: https://raw.githubusercontent.com/MK316/Workingpapers/main/Analysis/ksatdata_12only.csv (12 only, Q41 removed)

In [1]:
import pandas as pd

In [2]:
url = 'https://raw.githubusercontent.com/MK316/Workingpapers/main/Analysis/mydata01.csv'
df1 = pd.read_csv(url)

In [3]:
# df1.to_csv('/content/mydata.csv', sep=',', na_rep='NaN')
# import chardet
# with open('/content/ksatdata_item17.csv', 'rb') as f:
#   enc = chardet.detect(f.read())
# df1 = pd.read_csv('/content/ksatdata_item17.csv', encoding=enc['encoding'])

In [4]:
df1

Unnamed: 0,Year,Category,QN,Passage
0,2015,Context,Q18,One difference between winners and losers is h...
1,2015,Context,Q19,"As I walked to the train station, I felt the w..."
2,2015,Context,Q20,Many disciplines are better learned by enterin...
3,2015,Context,Q21,The most normal and competent child encounters...
4,2015,Context,Q22,The most normal and competent child encounters...
...,...,...,...,...
131,2022,Infer-Logic,Q36,"According to the market response model, it is ..."
132,2022,Infer-Logic,Q37,In spite of the likeness between the fictional...
133,2022,Infer-Logic,Q38,Retraining current employees for new positions...
134,2022,Infer-Logic,Q39,As long as the irrealism of the silent black a...


# Adding a colum with length info

In [5]:
df2 = df1

In [None]:
# Added column: String length
length = []

for i in range(0, len(df2['Passage'])):
  LEN = len(df2['Passage'][i])
  length.append(LEN)

df2['String'] = length
df2

Unnamed: 0,Year,Category,QN,Passage,String
0,2015,Context,Q18,One difference between winners and losers is h...,635
1,2015,Context,Q19,"As I walked to the train station, I felt the w...",628
2,2015,Context,Q20,Many disciplines are better learned by enterin...,715
3,2015,Context,Q21,The most normal and competent child encounters...,737
4,2015,Context,Q22,The most normal and competent child encounters...,724
...,...,...,...,...,...
131,2022,Infer-Logic,Q36,"According to the market response model, it is ...",1035
132,2022,Infer-Logic,Q37,In spite of the likeness between the fictional...,1025
133,2022,Infer-Logic,Q38,Retraining current employees for new positions...,981
134,2022,Infer-Logic,Q39,As long as the irrealism of the silent black a...,1068


In [None]:
# Added column:  Splitted words, Length of splitted words
tsplit = []
splen = []

for i in range(0, len(df2['Passage'])):
  TSP = df2['Passage'][i].split()
  SPLEN = len(TSP)
  tsplit.append(TSP)
  splen.append(SPLEN)
  # print(TSP)

df2['Splits'] = tsplit
df2['N_Splits'] = splen
df2

Unnamed: 0,Year,Category,QN,Passage,String,Splits,N_Splits
0,2015,Context,Q18,One difference between winners and losers is h...,635,"[One, difference, between, winners, and, loser...",107
1,2015,Context,Q19,"As I walked to the train station, I felt the w...",628,"[As, I, walked, to, the, train, station,, I, f...",123
2,2015,Context,Q20,Many disciplines are better learned by enterin...,715,"[Many, disciplines, are, better, learned, by, ...",117
3,2015,Context,Q21,The most normal and competent child encounters...,737,"[The, most, normal, and, competent, child, enc...",128
4,2015,Context,Q22,The most normal and competent child encounters...,724,"[The, most, normal, and, competent, child, enc...",128
...,...,...,...,...,...,...,...
131,2022,Infer-Logic,Q36,"According to the market response model, it is ...",1035,"[According, to, the, market, response, model,,...",163
132,2022,Infer-Logic,Q37,In spite of the likeness between the fictional...,1025,"[In, spite, of, the, likeness, between, the, f...",167
133,2022,Infer-Logic,Q38,Retraining current employees for new positions...,981,"[Retraining, current, employees, for, new, pos...",155
134,2022,Infer-Logic,Q39,As long as the irrealism of the silent black a...,1068,"[As, long, as, the, irrealism, of, the, silent...",174


# Lexical Diversity Measures

In [6]:
!pip install lexical-diversity
from lexical_diversity import lex_div as ld

Collecting lexical-diversity
  Downloading lexical_diversity-0.1.1-py3-none-any.whl (117 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/117.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━[0m [32m112.6/117.8 kB[0m [31m3.2 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m117.8/117.8 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: lexical-diversity
Successfully installed lexical-diversity-0.1.1


In [None]:
# Added column: String length
lem = []

for i in range(0, len(df2['Passage'])):
  LEM = ld.flemmatize(df2['Passage'][i])
  print(LEM)
  lem.append(LEM)

df2['Lemma'] = lem

In [8]:
df2

Unnamed: 0,Year,Category,QN,Passage,Lemma
0,2015,Context,Q18,One difference between winners and losers is h...,"[one, difference, between, winner, and, loser,..."
1,2015,Context,Q19,"As I walked to the train station, I felt the w...","[as, i, walk, to, the, train, station, i, feel..."
2,2015,Context,Q20,Many disciplines are better learned by enterin...,"[many, discipline, be, better, learn, by, ente..."
3,2015,Context,Q21,The most normal and competent child encounters...,"[the, most, normal, and, competent, child, enc..."
4,2015,Context,Q22,The most normal and competent child encounters...,"[the, most, normal, and, competent, child, enc..."
...,...,...,...,...,...
131,2022,Infer-Logic,Q36,"According to the market response model, it is ...","[accord, to, the, market, response, model, it,..."
132,2022,Infer-Logic,Q37,In spite of the likeness between the fictional...,"[in, spite, of, the, likeness, between, the, f..."
133,2022,Infer-Logic,Q38,Retraining current employees for new positions...,"[retraining, current, employee, for, new, posi..."
134,2022,Infer-Logic,Q39,As long as the irrealism of the silent black a...,"[as, long, as, the, irrealism, of, the, silent..."


In [12]:
# ADD LD indices

#1. Create empty lists.
TTR = []
RTTR = []
LogTTR = []
MassTTR = []
MSTTR = []
MATTR = []
HDD = []
MTLD = []
MTLD_wrap = []
MTLD_bid = []

# 2. Getting LD index values for each cell:

for i in range(0, len(df2['Lemma'])):
  flt = df2['Lemma'][i]
  ttr = ld.ttr(flt)
  rttr = ld.root_ttr(flt)
  logttr = ld.log_ttr(flt)
  mass = ld.maas_ttr(flt)
  fdttr = ld.msttr(flt)
  mattr = ld.mattr(flt)
  hdd = ld.hdd(flt)
  mtld = ld.mtld(flt)
  mtld_wrap = ld.mtld_ma_wrap(flt)
  mtld_bid = ld.mtld_ma_bid(flt)

  # Add values to each list
  TTR.append(ttr)
  RTTR.append(rttr)
  LogTTR.append(logttr)
  MassTTR.append(mass)
  MSTTR.append(fdttr)
  MATTR.append(mattr)
  HDD.append(hdd)
  MTLD.append(mtld)
  MTLD_wrap.append(mtld_wrap)
  MTLD_bid.append(mtld_bid)

# Add columns
df2['TTR'] = TTR
df2['RTTR'] = RTTR
df2['LogTTR'] = LogTTR
df2['MassTTR'] = MassTTR
df2['MSTTR'] = MSTTR
df2['MATTR'] = MATTR
df2['HDD'] = HDD
df2['MTLD'] = MTLD
df2['MTLD_wrap'] = MTLD_wrap
df2['MTLD_bid'] = MTLD_bid

## Result file

In [20]:
df2

Unnamed: 0,Year,Category,QN,Passage,Lemma,TTR,RTTR,LogTTR,MassTTR,MSTTR,MATTR,HDD,MTLD,MTLD_wrap,MTLD_bid
0,2015,Context,Q18,One difference between winners and losers is h...,"[one, difference, between, winner, and, loser,...",0.728972,7.540545,0.932349,0.033336,0.790000,0.794828,0.841506,110.542069,107.811321,0.000000
1,2015,Context,Q19,"As I walked to the train station, I felt the w...","[as, i, walk, to, the, train, station, i, feel...",0.666667,7.393691,0.915742,0.040317,0.760000,0.795405,0.824749,62.363391,65.430894,63.468687
2,2015,Context,Q20,Many disciplines are better learned by enterin...,"[many, discipline, be, better, learn, by, ente...",0.623932,6.748852,0.900946,0.047894,0.730000,0.758529,0.776571,53.164604,57.623932,58.574561
3,2015,Context,Q21,The most normal and competent child encounters...,"[the, most, normal, and, competent, child, enc...",0.703125,7.954951,0.927408,0.034450,0.840000,0.813165,0.848132,112.313725,106.812500,93.432773
4,2015,Context,Q22,The most normal and competent child encounters...,"[the, most, normal, and, competent, child, enc...",0.703125,7.954951,0.927408,0.034450,0.840000,0.813165,0.848132,112.313725,106.812500,93.432773
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
131,2022,Infer-Logic,Q36,"According to the market response model, it is ...","[accord, to, the, market, response, model, it,...",0.570552,7.284322,0.889835,0.049799,0.813333,0.810877,0.803874,87.560227,79.687117,78.402729
132,2022,Infer-Logic,Q37,In spite of the likeness between the fictional...,"[in, spite, of, the, likeness, between, the, f...",0.532934,6.887027,0.877030,0.055324,0.673333,0.690678,0.746895,35.844810,36.263473,35.668548
133,2022,Infer-Logic,Q38,Retraining current employees for new positions...,"[retraining, current, employee, for, new, posi...",0.574194,7.148652,0.889998,0.050222,0.793333,0.783962,0.804126,82.405437,73.432258,64.910794
134,2022,Infer-Logic,Q39,As long as the irrealism of the silent black a...,"[as, long, as, the, irrealism, of, the, silent...",0.574713,7.580980,0.892638,0.047918,0.773333,0.792000,0.797267,71.008388,69.804598,68.444683


In [21]:
df2.describe()

Unnamed: 0,Year,TTR,RTTR,LogTTR,MassTTR,MSTTR,MATTR,HDD,MTLD,MTLD_wrap,MTLD_bid
count,136.0,136.0,136.0,136.0,136.0,136.0,136.0,136.0,136.0,136.0,136.0
mean,2018.5,0.606593,7.44432,0.899571,0.046065,0.77201,0.770921,0.802974,70.714724,71.265205,63.743348
std,2.299758,0.05768,0.675416,0.018354,0.008092,0.049462,0.045547,0.039108,20.633628,20.801995,19.685693
min,2015.0,0.451282,5.376082,0.843935,0.031885,0.63,0.581905,0.6865,33.165609,32.157895,0.0
25%,2016.75,0.572516,7.022525,0.888772,0.03984,0.74,0.749653,0.780868,55.157307,57.424318,51.836679
50%,2018.5,0.603484,7.441873,0.899538,0.046061,0.776667,0.775913,0.806953,68.431332,68.806323,62.218431
75%,2020.25,0.648782,7.954951,0.91398,0.050385,0.806667,0.802464,0.833115,83.649431,83.517426,75.401795
max,2022.0,0.728972,9.071147,0.932349,0.073482,0.89,0.866111,0.880651,138.947368,132.21875,125.784722


# Plotting

In [15]:
from matplotlib import pyplot as plt

In [16]:
ordered_ttr = list(df2['TTR']).sort()
ordered_ttr

In [22]:
a1 = df2[['N_Splits','TTR','MSTTR']]
a2 = a1.sort_values(by=['N_Splits'])
a2

KeyError: "['N_Splits'] not in index"

In [23]:
df3 = df2.sort_values(by=['N_Splits'])

KeyError: 'N_Splits'

In [24]:
f = plt.figure(figsize=(10, 10))
plt.scatter(df2['N_Splits'],df2['TTR'],  label='TTR')
# plt.scatter(df2['N_Splits'],df2['LogTTR'],  label='LogTTR')
# plt.scatter(df2['N_Splits'],df2['MSTTR'],  label='MSTTR')
plt.legend()

KeyError: 'N_Splits'

<Figure size 1000x1000 with 0 Axes>

### Linear regression of TTR and Length

In [25]:
# importing the required library
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [26]:
# # f = plt.figure(figsize=(10, 10))
# fig, (ax1, ax2) = plt.subplots(1, 2)
# fig.suptitle('Horizontally stacked subplots')
# data
df = df2

# scatter plot with regression
# line(by default)
sns.lmplot(x ='N_Splits', y ='TTR', data = df)

# Show the plot
plt.show()

KeyError: "['N_Splits'] not in index"

In [27]:
# Residual plots
sns.residplot(x ='N_Splits', y ='TTR', data = df)

KeyError: 'N_Splits'

In [28]:
sns.lmplot(x ='N_Splits', y ='MTLD', data = df)

# Show the plot
plt.show()

KeyError: "['N_Splits'] not in index"

In [29]:
# Residual plots
sns.residplot(x ='N_Splits', y ='MTLD', data = df)

KeyError: 'N_Splits'

In [30]:
import statsmodels.api as sm

X = df["N_Splits"]
y = df["TTR"]

# Note the difference in argument order
model = sm.OLS(y, X).fit()
predictions = model.predict(X) # make the predictions by the model

# Print out the statistics
model.summary()

KeyError: 'N_Splits'

In [None]:
X = df["N_Splits"]
y = df["MTLD"]

# Note the difference in argument order
model = sm.OLS(y, X).fit()
predictions = model.predict(X) # make the predictions by the model

# Print out the statistics
model.summary()

## Boxplots

In [None]:
import seaborn as sns
import numpy as np

### Setting the figure size

In [None]:
sns.set(rc={'figure.figsize':(12,8)}) #set width and height

In [None]:
df = df2
df = df[['Year','TTR','MSTTR','HDD','MTLD']]

dd=pd.melt(df,id_vars=['Year'],value_vars=['TTR','HDD','MSTTR'],var_name='fruits')
sns.boxplot(x='Year',y='value',data=dd,hue='fruits')

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# --- Your data, e.g. results per algorithm:
data1 = [5,5,4,3,3,5]
data2 = [6,6,4,6,8,5]
data3 = [7,8,4,5,8,2]
data4 = [6,9,3,6,8,4]

# --- Combining your data:
data_group1 = [data1, data2]
data_group2 = [data3, data4]

# --- Labels for your data:
labels_list = ['a','b']
xlocations  = range(len(data_group1))
width       = 0.3
symbol      = 'r+'
ymin        = 0
ymax        = 10

ax = plt.gca()
ax.set_ylim(ymin,ymax)
ax.set_xticklabels( labels_list, rotation=0 )
ax.grid(True, linestyle='dotted')
ax.set_axisbelow(True)
ax.set_xticks(xlocations)
plt.xlabel('X axis label')
plt.ylabel('Y axis label')
plt.title('title')

# --- Offset the positions per group:
positions_group1 = [x-(width+0.01) for x in xlocations]
positions_group2 = xlocations

plt.boxplot(data_group1,
            sym=symbol,
            labels=['']*len(labels_list),
            positions=positions_group1,
            widths=width,
#           notch=False,
#           vert=True,
#           whis=1.5,
#           bootstrap=None,
#           usermedians=None,
#           conf_intervals=None,
#           patch_artist=False,
            )

plt.boxplot(data_group2,
            labels=labels_list,
            sym=symbol,
            positions=positions_group2,
            widths=width,
#           notch=False,
#           vert=True,
#           whis=1.5,
#           bootstrap=None,
#           usermedians=None,
#           conf_intervals=None,
#           patch_artist=False,
            )

plt.savefig('boxplot_grouped.png')
plt.savefig('boxplot_grouped.pdf')    # when publishing, use high quality PDFs
#plt.show()                   # uncomment to show the plot.

In [None]:
dd = pd.melt(df3, id_vars = 'Year'],value_vars = ['TTR','HDD'],var_name='LD')
sns.boxplot(x = 'Year', y = 'value', data=dd, hue='LD')


### MTLD grouped plots

In [None]:
datatop = df2.tail()
datatop

column names

In [None]:
for col in df2.columns:
    print(col)

In [None]:
groups = df2.groupby("Category")
for name, group in groups:
    plt.plot(group["N_Splits"], group["MTLD"], marker="o", linestyle="", label=name)
plt.legend()

In [None]:
f = plt.figure(figsize=(10, 10))
flights = sns.load_dataset("flights")
flights = flights.pivot("month", "year", "passengers")
ax = sns.heatmap(flights)

data = df2
data = data.pivot("Year", "QN", "MTLD")
ax = sns.heatmap(data)

In [None]:
import pandas as pd
import numpy as np

rs = np.random.RandomState(0)
df = pd.DataFrame(rs.rand(10, 10))
corr = df.corr()
corr.style.background_gradient(cmap='coolwarm')

In [None]:
df3

In [None]:
f = plt.figure(figsize=(10, 10))
plt.matshow(df.corr(), fignum=f.number)
plt.xticks(range(df.select_dtypes(['number']).shape[1]), df.select_dtypes(['number']).columns, fontsize=14, rotation=45)
plt.yticks(range(df.select_dtypes(['number']).shape[1]), df.select_dtypes(['number']).columns, fontsize=14)
cb = plt.colorbar()
cb.ax.tick_params(labelsize=14)
plt.title('Correlation Matrix', fontsize=16);

In [None]:
df

In [None]:
# df4 = df3.iloc[:, [6,8,9,10,11,12,13,14,15,16,17]]

In [None]:
# import seaborn as sns; sns.set_theme()