# Introductory Tutorial

<a target="_blank" href="https://colab.research.google.com/github/HLasse/TextDescriptives/blob/main/docs/tutorials/introductory_tutorial.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

TextDescriptives lets you quickly and easily calculate a large variety of statistics and metrics from texts. 

The package includes a number of components that make it easy to only spend time extracting exactly the metrics you care about.

This tutorial introduces some of components available in TextDescriptives and how you can use them to quickly analyse a text corpus.
For more information on the components, see the [documentation](https://hlasse.github.io/TextDescriptives/).

## Exploratory Data Analysis
In this tutorial we'll use TextDescriptives to get a quick overview of the [SMS Spam Collection Data Set](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection).
The dataset contains 5572 SMS messages categorized as ham or spam. 

To start, let's load a `spacy` pipeline and add some components to it. 


In [None]:
try:
    import textdescriptives
except:
    !pip install "textdescriptives[tutorials]"

# download spaCy model
!python -m spacy download en_core_web_sm

## 

In [None]:
import spacy
import textdescriptives as td

nlp = spacy.load("en_core_web_sm")

nlp.add_pipe("textdescriptives/readability")
nlp.add_pipe("textdescriptives/dependency_distance")

From now on, whenever we pass a document through the spacy pipeline (`nlp`), TextDescriptives will add readability and dependency distance metrics to the document.

Let's load the data and pass it through the pipeline.

In [None]:
from textdescriptives.utils import load_sms_data

df = load_sms_data()
df.head()

In [None]:
df["label"].value_counts()

In [None]:
doc = nlp.pipe(df["message"])

In [None]:
metrics = td.extract_df(doc, include_text=False)

In [None]:
# join the metrics to the original dataframe
df = df.join(metrics, how="left")

That's all we need to do! Let's take a look at the dataframe and the metrics we have calculated

In [None]:
df.head()

Let's do some quick exploratory data analysis to get a sense of the data

In [None]:
import seaborn as sns
sns.boxplot(x="label", y="lix", data=df)

Let's run a quick test to see if any of our metrics correlate strongly with the label

In [None]:
# encode the label as a boolean
df["is_ham"] = df["label"] == "ham"
# compute the correlation between all metrics and the label
metrics_correlations = metrics.corrwith(df["is_ham"]).sort_values(key=abs, ascending=False)
metrics_correlations[:10]

That's some pretty high correlations - let's plot a few of them! 

In [None]:
# plot a kde plot for the top 3 metrics
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1,3, figsize=(10, 5), sharey=False)
for i, metric in enumerate(metrics_correlations.index[:3]):
    sns.kdeplot(df, x=metric, hue="label", ax=ax[i])

Cool! We've now done a quick analysis of the SMS dataset and found the distributions of e.g. the standard deviation of token length, the number of characters, and the number of unique tokens to be distributed differently between the actual SMS's and spam. 

Next steps could be continue the exploratory data analysis or to build a simple classifier using the extracted metrics.