# Acquire:

**Objective/Goal:** 

Our goal is to design and test a machine learning model that will take in an essay and return if the essay was written by AI or by human.

**Details:**
# File and Field Information

## `{test|train}_essays.csv`

- `id` - A unique identifier for each essay.
- `prompt_id` - Identifies the prompt the essay was written in response to.
- `text` - The essay text itself.
- `generated` - Whether the essay was written by a student (`0`) or generated by an LLM (`1`). This field is the target and is not present in `test_essays.csv`.

## `train_prompts.csv` - Essays were written in response to information in these fields.

- `prompt_id` - A unique identifier for each prompt.
- `prompt_name` - The title of the prompt.
- `instructions` - The instructions given to students.
- `source_text` - The text of the article(s) the essays were written in response to, in Markdown format. Significant paragraphs are enumerated by a numeral preceding the paragraph on the same line, as in `0 Paragraph one.\n\n1 Paragraph two.` Essays sometimes refer to a paragraph by its numeral. Each article is preceded with its title in a heading, like `# Title`. When an author is indicated, their name will be given in the title after `by`. Not all articles have authors indicated. An article may have subheadings indicated like `## Subheading`.

## `sample_submission.csv` - A submission file in the correct format. See the Evaluation page for details.

We are importing the data that was provided for the competition. Below is a reminder of rules as it pertains to outside data.



## Wrangle Imports:

In [1]:
import unicodedata
import re
import json

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

import pandas as pd

In [2]:
prompts = pd.read_csv("data/llm-detect-ai-generated-text/train_essays.csv")

In [3]:
len(prompts)

1378

In [4]:
responses = pd.read_csv("data/llm-detect-ai-generated-text/train_essays.csv")

In [5]:
len(responses)

1378

# Preparing the text

## Responses: First Glance

In [6]:
prompts.head()

Unnamed: 0,id,prompt_id,text,generated
0,0059830c,0,Cars. Cars have been around since they became ...,0
1,005db917,0,Transportation is a large necessity in most co...,0
2,008f63e3,0,"""America's love affair with it's vehicles seem...",0
3,00940276,0,How often do you ride in a car? Do you drive a...,0
4,00c39458,0,Cars are a wonderful thing. They are perhaps o...,0


### Responses notes: 

* I need to take a closer look at just the text columns to identify the best way to make the text more pythonic. Some concerns I have about altering the form of the prompt responses is that this might destroy some crucial inisghts on the grammatical styles of humans and AI. We can explore this concept at a later time.

In [7]:
responses.head(3)

Unnamed: 0,id,prompt_id,text,generated
0,0059830c,0,Cars. Cars have been around since they became ...,0
1,005db917,0,Transportation is a large necessity in most co...,0
2,008f63e3,0,"""America's love affair with it's vehicles seem...",0


In [8]:
# First We are going to standardize the text.
# First we are going to lower all the case. 
responses.text = responses.text.str.lower()

In [9]:
# Next we will remove all the special characters.

responses['text'] = responses['text'].apply(lambda x: unicodedata.normalize('NFKD',x)
                                           .encode('ascii', 'ignore')
                                           .decode('utf-8', 'ignore'))

responses.text

0       cars. cars have been around since they became ...
1       transportation is a large necessity in most co...
2       "america's love affair with it's vehicles seem...
3       how often do you ride in a car? do you drive a...
4       cars are a wonderful thing. they are perhaps o...
                              ...                        
1373    there has been a fuss about the elector colleg...
1374    limiting car usage has many advantages. such a...
1375    there's a new trend that has been developing f...
1376    as we all know cars are a big part of our soci...
1377    cars have been around since the 1800's and hav...
Name: text, Length: 1378, dtype: object

In [10]:
responses['text'] = responses['text'].str.replace(r"[^a-z0-9'\s]", '', regex=True)

## Tokenizing the text 

In [11]:
tokenizer = nltk.tokenize.ToktokTokenizer()

responses.text = responses.text.apply(lambda x: tokenizer.tokenize(x, return_str=True))

In [12]:
responses.text[0]

"cars cars have been around since they became famous in the 1900s when henry ford created and built the first modelt cars have played a major role in our every day lives since then but now people are starting to question if limiting car usage would be a good thing to me limiting the use of cars might be a good thing to do\n\nin like matter of this article in german suburb life goes on without cars by elizabeth rosenthal states how automobiles are the linchpin of suburbs where middle class families from either shanghai or chicago tend to make their homes experts say how this is a huge impediment to current efforts to reduce greenhouse gas emissions from tailpipe passenger cars are responsible for 12 percent of greenhouse gas emissions in europeand up to 50 percent in some carintensive areas in the united states cars are the main reason for the greenhouse gas emissions because of a lot of people driving them around all the time getting where they need to go article paris bans driving due

## Stemming the Words

In [14]:
stems = []
ps = nltk.porter.PorterStemmer()
text_response_list = responses.text.to_list()
for response in text_response_list:
    for word in response.split():
        stemmed_word = ps.stem(word)
        stems.append(stemmed_word)

In [19]:
pd.Series(stems).value_counts().head(10)

the     48931
to      23922
of      22210
a       18912
and     17285
in      16430
is      12546
car     11593
that    11352
vote    10553
dtype: int64

## Removing Stop Words 

In [20]:
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/fermingarcia/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [21]:
stopword_list = stopwords.words('english')
stopword_list[:10]


['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [23]:
filtered_words = [w for w in stems if w not in stopword_list]

In [24]:
pd.Series(filtered_words).value_counts().head(10)

car        11593
vote       10553
elector     9954
'           7120
peopl       7043
state       6291
colleg      6124
thi         5894
presid      3993
would       3790
dtype: int64

## Lets make a fuction that will take in a pandas series and do what we have done so far.


In [31]:
def stemmer_function(series):
    # First we lowercases the words.
    series = series.str.lower().str.replace(r"[^a-z0-9'\s]", '', regex=True)
    
    # Then we tokenized the words 
    tokenizer = nltk.tokenize.ToktokTokenizer()
    series = series.apply(lambda x: tokenizer.tokenize(x, return_str=True))
    stems = []
    ps = nltk.porter.PorterStemmer()
    text_response_list = responses.text.to_list()
    for response in text_response_list:
        for word in response.split():
            stemmed_word = ps.stem(word)
            stems.append(stemmed_word)
            
    stopword_list = stopwords.words('english')
    
    filtered_listed = [w for w in stems if w not in stopword_list]
    
    return filtered_listed

## Spliting the text from students, and AI.

In [27]:
text_with_target_variable = responses[['text', 'generated']]

In [28]:
ai_generated_essays = responses.loc[text_with_target_variable.generated == 1]

In [29]:
ai_generated_essays 

Unnamed: 0,id,prompt_id,text,generated
704,82131f68,1,this essay will analyze discuss and prove one ...,1
740,86fe4f18,1,i strongly believe that the electoral college ...,1
1262,eafb8a56,0,limiting car use causes pollution increases co...,1


In [33]:
import nlp_functions

nlp_functions.stemmer_function(ai_generated_essays.text)

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/fermingarcia/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['thi',
 'essay',
 'analyz',
 'discuss',
 'prove',
 'one',
 'reason',
 'favor',
 'keep',
 'elector',
 'colleg',
 'unit',
 'state',
 'presidenti',
 'elect',
 'one',
 'reason',
 'keep',
 'elector',
 'colleg',
 'better',
 'smaller',
 'rural',
 'state',
 'influenc',
 'oppos',
 'larger',
 'metropolitan',
 'area',
 'larg',
 'popul',
 'elector',
 'state',
 'grant',
 'two',
 'vote',
 'larger',
 'popul',
 'area',
 'grant',
 'one',
 'vote',
 'smaller',
 'state',
 'tend',
 'hold',
 'signific',
 'power',
 'becaus',
 'two',
 'vote',
 'presid',
 'vice',
 'presid',
 'add',
 'vote',
 'larger',
 'state',
 'mani',
 'elector',
 'thi',
 'becaus',
 'split',
 'elector',
 'vote',
 'argu',
 'elector',
 'bound',
 'vote',
 'candid',
 'vote',
 'nation',
 'vote',
 'state',
 "'",
 'nomine',
 'unless',
 'state',
 'ha',
 'winner',
 'take',
 'system',
 'howev',
 'state',
 'adopt',
 'law',
 'forc',
 'elector',
 'vote',
 'state',
 "'",
 'candid',
 'seem',
 'matter',
 'elector',
 'bound',
 'vote',
 'candid',
 'nation',
