<a href="https://colab.research.google.com/github/PurpleDin0/QDA_NLG_Detection/blob/master/Data_Cleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Collect and prep data for analysis

Contact info: Barney.Ales@gmail.com



#Introduction
This is the start of a multi-step process to analyze the capabilites of Natural Language Generation (NLG) models.  Work was split into three python notebooks in order to provide credit to code written by others, simplify future analysis, and maximize possible code re-use by others.  All notebooks are publicly availible and posted to [my Github](https://github.com/PurpleDin0/QDA_NLG_Detection), with the training data I used.  <font color=yellow> The trained model is ~500 MB in size.  This is too large of a file for GitHub so it is located in google drive here**(insert link to final google drive resting location)**.</font>

1.   [Data_Cleaning.ipynb](https://github.com/PurpleDin0/QDA_NLG_Detection/blob/master/Data_Cleaning.ipynb) (This notebook)
  0.   Explain the project
  1.   Import data
  2.   Perform Basic data cleaning
  3.   Train the Markovify model
  4.   Generate text with the trained Markovify Model
2.  ["Train a GPT-2 Text-Generating Model w/ GPU.ipynb"](https://github.com/PurpleDin0/QDA_NLG_Detection/blob/master/Train%20a%20GPT-2%20Text-Generating%20Model%20w_%20GPU.ipynb)
  1. Fine-tune 124M-parameter version of GPT-2
  2. Save fine-tuned model
  3. Generate text with the trained model 
3. [Thesis_Analysis.ipynb](https://github.com/PurpleDin0/QDA_NLG_Detection/blob/master/Analysis.ipynb)
  1. Load generated data
  2. Generate Analyze data
  3. Graph results

##Research Question & Hypothesis

**Research Question**  
Can machine-generated text be detected through quantitative evaluation of centering resonance analysis (CRA) networks?  
**Hypothesis**  
Machine-generated text created using Natural Language Generation (NLG) systems will be more discursively similar to other samples of machine-generated text by a statistically significant degree than to comparable human-created text content.

#First load all the relevant libraries
---
* [pandas](https://pandas.pydata.org/): used to read in CSV data, do basic data cleaning, and store all our data.  This is a heavy hitter of Python for Data Science.
* [Markovify](https://datascienceplus.com/natural-language-generation-with-markovify-in-python/): used to build markov chain generator, link is to instructions.  Here is the [Github link](https://github.com/jsvine/markovify)
* [Pickle](https://docs.python.org/3/library/pickle.html): Used to save variables for later use
* [JSON](https://docs.python.org/3/library/json.html): Used to save data to files for later use

In [0]:
#markovify needs to be installed as it isn't a baseline python module
!pip install markovify 

#clone the GitHub repo with all the training data.
!git clone https://github.com/PurpleDin0/QDA_NLG_Detection.git 

# navigate to the created folder
%cd /content/Quantitative-Discursive-Analysis/ 

import pandas as pd #Pandas, so we can do lots of cool data science stuff
print("Pandas imported as Version: ",pd.__version__)
import markovify #Markov Chain Generator, train and generate an NLG model
print("Markovify imported as Version: ", markovify.__version__)
import pickle #So we can save any of our output variables for later use
import json #So we can save any of our output items for later use

Collecting markovify
  Downloading https://files.pythonhosted.org/packages/de/c3/2e017f687e47e88eb9d8adf970527e2299fb566eba62112c2851ebb7ab93/markovify-0.8.0.tar.gz
Collecting unidecode
[?25l  Downloading https://files.pythonhosted.org/packages/d0/42/d9edfed04228bacea2d824904cae367ee9efd05e6cce7ceaaedd0b0ad964/Unidecode-1.1.1-py2.py3-none-any.whl (238kB)
[K     |████████████████████████████████| 245kB 11.6MB/s 
[?25hBuilding wheels for collected packages: markovify
  Building wheel for markovify (setup.py) ... [?25l[?25hdone
  Created wheel for markovify: filename=markovify-0.8.0-cp36-none-any.whl size=10694 sha256=af6345a0af2ff28bfb4ad1a80a07d78011a0b3c858614305a844e9bce2842fc2
  Stored in directory: /root/.cache/pip/wheels/5d/a8/92/35e2df870ff15a65657679dca105d190ec3c854a9f75435e40
Successfully built markovify
Installing collected packages: unidecode, markovify
Successfully installed markovify-0.8.0 unidecode-1.1.1
Pandas imported as Version:  1.0.3
Markovify imported as Version

# Import your data 
First you need to get a CSV loaded with data that you want to retrain your GPT-2 model and train your Markovify model.  I used English Language state controlled media for my training/fine tuning data.

Upload your csv file to your google drive folder my file is named "English_Language-State_Media-10_15_to_11_15-2019.csv", and is stored at "/content/QDA_NLG_Detection/Data'.
This is in the repo we cloned from GitHub.

Next navigate to the folder you stored your file.  My file is stored -> /content/drive/My Drive/Colab Notebooks/

In [0]:
%cd /content/Quantitative-Discursive-Analysis/Data/
%ls

/content/drive/My Drive/Colab Notebooks
 [0m[01;34mcheckpoint[0m/                                gpt_sputnik_text_data_5.json
'Copy of bookworm.ipynb'                    gpt_sputnik_text_data_6.json
'Copy of MST698S_CNN-exercise.ipynb'        markovify_text_stored.json
'Copy of sentiment_analysis.ipynb'         'Monday_Makeover (1).ipynb'
'Copy of text_processing.ipynb'             Monday_Makeover.ipynb
 [01;34mCoursework[0m/                                mutliprocessing_test.ipynb
 covid_19_test.ipynb                        NLP_Udacity.ipynb
 English_Language-State_Media-11-2019.csv   Research_Notes.ipynb
 gpt_sputnik_text_data_1.json               Sputnik_body_alltext.p
 gpt_sputnik_text_data_2.json               [01;34mThesis[0m/
 gpt_sputnik_text_data_3.json               Thesis_Analysis-ALES.ipynb
 gpt_sputnik_text_data_4.json               Untitled0.ipynb


Data in `English_Language-State_Media-10_15_to_11_15-2019.csv` was collected and exported from an online news aggregation site. 

- [x] Import raw news data export (English_Language-State_Media-10_15_to_11_15-2019.csv)
- [x] Display the basic information about the data

In [0]:
data = pd.read_csv('English_Language-State_Media-10_15_to_11_15-2019.csv') #imports the file
data.columns = data.columns.str.replace(' ', '_') #cleans up the columns by replacing spaces with "_" - Note: Spaces are evil and their use in code is immoral :-P
#data.head() #This displays the first few rows of the data
data = data.drop(columns=['Unnamed:_0']) #drop the old index from the csv file
data.info() #This displays info on the data

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28845 entries, 0 to 28844
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Title               28845 non-null  object
 1   Body                28793 non-null  object
 2   Source_Name         28845 non-null  object
 3   Source_Date,_Start  28845 non-null  object
 4   Creator             20224 non-null  object
 5   Keywords            4541 non-null   object
 6   Source_Medium       28845 non-null  object
dtypes: object(7)
memory usage: 1.5+ MB


##Show some more info on the dataframe

In [0]:
#show the datafram
data.Creator

0                   RT
1        Lucas Jackson
2                   RT
3                   RT
4                   RT
             ...      
28840              NaN
28841              NaN
28842              NaN
28843              NaN
28844              NaN
Name: Creator, Length: 28845, dtype: object

In [0]:
#show information on each column
for column in data.columns.values: 
  print(data[column].value_counts())
  print('\n\n')


Xinhua photos of the day                                                              30
China-related news briefing                                                           25
What's trending worldwide                                                             21
[UNABLE TO COLLECT DUE TO SITE ERROR]                                                 12
Chinese shares close lower Wednesday                                                   8
                                                                                      ..
Nankai University celebrates 100th anniversary with glory from past and for future     1
Babysitter is Not Amused: Cat Teaches Corgi Puppies Good Manners                       1
Haiti's opposition rejects dialogue proposed by Washington                             1
White Cane Safety Day event held in Beirut, Lebanon                                    1
Trump attacks Dems, 'LameStream' media as impeachment nears public phase               1
Name: Title, Length: 

# Perform basic data cleaning/prep#
The above info tells us that the data has 28,793 articles, but there is some weirdness to the values.  The quantity of body items is less then the quantity of title items, this generally means that some of the values are null (stored as NaN), but we don't want this so we will replace those NaN values with an empty text string "".   

In [0]:
values = {'Body': ""} #create a dictionary with the columns that you want to search and the value you want to replace.
data = data.fillna(value=values) #Some of the body field is blank
data.info() #display the info to see if we fixed it

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28845 entries, 0 to 28844
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Title               28845 non-null  object
 1   Body                28845 non-null  object
 2   Source_Name         28845 non-null  object
 3   Source_Date,_Start  28845 non-null  object
 4   Creator             20224 non-null  object
 5   Keywords            4541 non-null   object
 6   Source_Medium       28845 non-null  object
dtypes: object(7)
memory usage: 1.5+ MB


In [0]:
print(data.Source_Name.value_counts()) #Prints a count of the number of entries for each of the unique sources

Xinhua                                 12765
Sputnik                                 4320
China Daily Online (Global Edition)     3423
Global Times Online                     1981
RT Online                               1828
Fars News Agency                        1427
Prensa Latina                            875
Tasnim                                   798
Press TV                                 631
Telesur Online                           527
Cuban News Agency                        216
Granma Online                             54
Name: Source_Name, dtype: int64


That is alot of news articles and I want to "write" articles in the style of this media, but I do not have the time to read all those so lets do some data science! 

First we will parse out the sputnik news articles and store all the text as one long string.  We will start each article "\<START_TEXT\>" and end with "\<END_TEXT\>", for easy future parsing.

In [0]:
source_selection = 'Sputnik' #starting with Sputnik as a text case, if this works the code can be modified to do this for each source
selected_data = data.loc[data['Source_Name'] == source_selection]
body_alltext = ""
for index, row in selected_data.iterrows(): #Note: iterating over a dataframe is strongly not reccomended
    body_alltext += "<START_TEXT>" 
    body_alltext += row['Body']
    body_alltext += "<END_TEXT>" 
    #body_alltext += "\n\n" 
    
#check to see how if all the articles where input correctly
if data.Source_Name.value_counts()['Sputnik'] == body_alltext.count('<END_TEXT>'): 
    print(f"number of articles combined is: {body_alltext.count('<END_TEXT>')}")
else:
    print(f"Something went wrong, we combined {body_alltext.count('<END_TEXT>')} out of {data.Source_Name.value_counts()['Sputnik']} articles")

number of articles combined is: 4320


Dump the output text to a a csv and a pickle file for future use.  Both where used to simplify future access.

In [0]:
selected_data['Body'].to_csv('Sputnik_body_alltext.csv', index=False)  #save the parsed data to a csv file





In [0]:
pickle.dump( body_alltext, open( "Sputnik_body_alltext.p", "wb" ) )

# Train the Markovify Model
Now lets train a markovify model with the Sputnik text to make a russian news bot.

First we will load the data from the pickle file we saved

In [0]:
%cd /content/drive/'My Drive'/'Colab Notebooks'
body_alltext = pickle.load( open( "Sputnik_body_alltext.p", "rb" ) )

In [0]:
sputnik_model = markovify.Text(body_alltext)

#Generate text with the trained Markovify model 
Now lets generate some news

In [0]:
markovify_text_stored = []
for i in range(0, 12500):
    markovify_text_stored.append(sputnik_model.make_sentence())

%cd /content/drive/'My Drive'/'Colab Notebooks'
with open('markovify_text_stored.json', 'w', encoding='utf-8') as f:
   json.dump(markovify_text_stored, f, ensure_ascii=False, indent=4)
%ls

/content/drive/My Drive/Colab Notebooks
[0m[01;34m'698R - Data Science Math'[0m/
 [01;34mcheckpoint[0m/
'Copy of Train a GPT-2 Text-Generating Model w  GPU'
 English_Language-State_Controlled_Media-11-2019.csv
 markovify_text_stored.json
 [01;34mMST698O-Intro_Data_Science[0m/
 [01;34mMST698R-Data_Science_Math[0m/
 MST-698R_Project_Ales.ipynb
 Research_Notes.ipynb
 Sputnik_body_alltext.p
 [01;34mThesis[0m/


In [0]:
#Open the data we generated with markovify
with open("markovify_text_stored.json", "r") as markovify_read_file: 
    markovify_text_data = json.load(markovify_read_file)
print("example Markovify content:\n")
for n in range(70, 80):
    print(markovify_text_data[n])

example Markovify content:

The material which purports to support inclusiveness had been downloaded to a man - allegedly the murderous regime of Nicolas Dupont-Aignan, the Belgian People's Party described the US has been confined to a statement by Saudi Crown Prince Mohammed bin Zayed Al Nahyan, and the manner in the Southeast Asian nations with many mocking the people outside the Swedish parliament in Stockholm in 2018.
Mexico, where he was forced to ask Drake for clarification.
Are women equally as capable of carrying out a simulated bombing of Nagasaki the only way tackle this problem seriously and Misawa has suspended diplomatic, trade and diplomatic row for several years.
According to the global office of Prime Minister Imran Khan should be free of charge, if they had registered 4,628 complaints, of which DC is a man was rescued from a standard precautionary measure due to the oilfields area.
Gwadar Master Plan was also taken to prison reform, and drafting immigration policy.
“In

Awesome, now we have a trained Markov Model. However, as you can see even though the system generates mostly grammatically correct sentences, they are incoherent when joined together.
So, lets build a system that can generate long coherent sentences. [Enter the GPT-2 fine tuning Notebook](https://colab.research.google.com/drive/1Hs30ZifOvO6T4WSDVWS7H7LaxFmzv-ER#scrollTo=4RNY6RBI9LmL)


# License Information
MIT License

Copyright (c) 2020 Barney Ales

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.