# Intro to Text Mining with Python
This workshop we will be using the NLTK library to walk you through some basic steps of a text mining project. NLTK is one of the most popular libraries used to work with human language data.

"Text mining, also referred to as text analysis, is the process of obtaining meaningful information from large collections of unstructured data. By automatically identifying patterns, topics, and relevant keywords, text mining uncovers relevant insights that can help you answer specific questions." -monkeylearn.com

Some basic steps of text mining we are going to demonstrate include:

       -Parsing
       -Comparison Methods
       -String Operations

These basic steps can utilized later on in:

       -Complex Parsing
       -Text Mining Techniques:
          -Information Extraction
          -Information Retrieval
          -Categorization
          -Clustering
          -Summarization

In [None]:
#Note: If you are using Jupyter Notebooks instead of Google Colab to run this file and code, you will need to install different libraries in your command prompt (e.g. pip install)

from pathlib import Path #provides an object api for working with files and directories
import pandas as pd #library used for data science and machine learning
import os #provides functions for interacting with operating systems
import glob #used to return all file paths that match a specific pattern
import sys #provides functions and variables used to manipulate different part of the Python runtime environment

### Import files

In order to import a folder of files, we use the os.chdir function to first navigate to the right directory.

Then we use glob.glob function to iterate through all files.

In [None]:
my_dir = "Sample_data"
os.chdir(my_dir)   #change the current working directory to specified path. 

In [None]:
reviewList=[]
#code through here
for files in glob.glob("*.txt"):   #glob.glob returns a list of pathnames. It helps us loop through all files that are .txt in the sample folder
    df = pd.read_csv(files) #dataframe, data structure that organizes data into a 2-dimensional table of rows and columns, like a spreadsheet
    #print(df)
    for content in df:  
        reviewList.append(content) #add all the data (or in this case the strings in the files in sample data) to this list
print (reviewList) #see the list of Strings from all the .txt files

Convert the review list into a huge string.

In [None]:
str1 = " " #String that will combine all the strings in the reviewList into 1 huge string, want it as a bag of words
data = str1.join(reviewList) #combines all the strings, data is a string
#allows us to not have to use anymore loops to do the same function for all the separate strings in the reviewList
data = data.replace("<br />","") #deletes any breaks or \n
print (data)

## Parsing

Text Parsing is the the task of separating the text we want to analyze into smaller components based on some rules. This is very common in different applications such as document parsing and NLP.

### Tokenization
Tokenization is the process by which big quantity of text is divided into smaller parts called tokens.

### Remove punctuations and Stop Words
Stop Words are words that are so commonly used that they carry very little useful information.

http://www.nltk.org/nltk_data/

In [None]:
import nltk
# nltk.download_shell() for mac users
from nltk.tokenize import word_tokenize #word is splits a string into individual words called tokens

In [None]:
nltk.download('punkt')
#Your Code Here

In [None]:
from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(r'\w+')
#Your Code Here

In [None]:
from nltk.corpus import stopwords 
#stopwords are words that can be safely ignored, they don't add much meaning to a sentence outside of grammar

In [None]:
nltk.download('stopwords')

stop_words = set(stopwords.words('english')) #tell it we want english

#Your Code Here

In [None]:
len(filtered_words)

In [None]:
print(set(filtered_words))

In [None]:
len(set(filtered_words))

In [None]:
len(tokens)

In [None]:
print(set(tokens))

In [None]:
len(set(tokens))

In [None]:
print(set(w.lower() for w in tokens))

In [None]:
len(set(w.lower() for w in tokens))

Add annotation about startswith and endswitch


In [None]:
print(sorted([w for w in tokens if w.startswith('A')]))

In [None]:
print(sorted([w for w in tokens if w.endswith('a')]))

## Comparison Methods

### isupper()
isupper() returns True if all alphabetical characters are uppercased.

In [None]:
#isupper()
str1 = "THWG"
str2 = "THWG!"
str3 = "TO HELL WITH GEORGIA!"
str4 = "To hell with Georgia!"
str5 = "to hell with georgia!"
print(str1.isupper())
print(str2.isupper())
print(str3.isupper())
print(str4.isupper())
print(str5.isupper())

### islower()
islower() returns True if all alphabetical characters are lowercased.

In [None]:
#islower()
print(str1.islower())
print(str2.islower())
print(str3.islower())
print(str4.islower())
print(str5.islower())

### istitle()
istitle() returns True if only the first letters of every word is capitalized, therefore making the String at Title.

In [None]:
#istitle()
str6 = "To Hell the Georgia!"
str7 = "To Hell The Georgia!"
print(str3.istitle()) #str3 = "TO HELL WITH GEORGIA!"
print(str6.istitle())
print(str7.istitle())

### isalpha()
isalpha() returns True if all the characters in the string are alphabetic

In [None]:
#isalpha()
str1 = "To Hell With Georgia"
str2 = "!"
str3 = "THWG"
print(str1.isalpha())
print(str2.isalpha())
print(str3.isalpha())

### isdigit()
isdigit() returns True if all the characters in the string are digits. Unicodes count as digits.

In [None]:
#isdigit() returns true if all characters in the string are digits
#unicode is the encoding standard of assigning each letter, digit, or symbol a unique numeric value
nstr1 = "\u0030" #▲
nstr2 = "\u00B2" #☻
nstr3 = "\u00BD" #½
nstr4 = '\u00B23455' #☻3455
nstr5 = "ABC"
nstr6 = " "
nstr7 = "☻"
nstr8 = "1234567"
nstr9 = "½"
nstr10 = "12.34"

print(nstr1.isdigit())
print(nstr2.isdigit())
print(nstr3.isdigit())
print(nstr4.isdigit())
print(nstr5.isdigit())
print(nstr6.isdigit())
print(nstr7.isdigit())
print(nstr8.isdigit())
print(nstr9.isdigit())
print(nstr10.isdigit())

### isnumeric()
isnumeric() return True if all characters in the string are numeric. Unicodes count as numeric.

In [None]:
#isnumeric()
nstr1 = "\u0030" #▲
nstr2 = "\u00B2" #☻
nstr3 = "\u00BD" #½
nstr4 = '\u00B23455' #☻3455
nstr5 = "ABC"
nstr6 = " "
nstr7 = "☻"
nstr8 = "1234567"
nstr9 = "½"
nstr10 = "12.34"

print(nstr1.isnumeric())
print(nstr2.isnumeric())
print(nstr3.isnumeric())
print(nstr4.isnumeric())
print(nstr5.isnumeric())
print(nstr6.isnumeric())
print(nstr7.isnumeric())
print(nstr8.isnumeric())
print(nstr9.isnumeric())
print(nstr10.isnumeric())

### isdecimal()
isdecimal() return True if the string represents a decimal

In [None]:
#isdecimal() returns true if string represents a decimal
nstr1 = "\u0030" #▲
nstr2 = "\u00B2" #☻
nstr3 = "\u00BD" #½
nstr4 = '\u00B23455' #☻3455
nstr5 = "ABC"
nstr6 = " "
nstr7 = "☻"
nstr8 = "1234567"
nstr9 = "½"
nstr10 = "12.34"

print(nstr1.isdecimal())
print(nstr2.isdecimal())
print(nstr3.isdecimal())
print(nstr4.isdecimal())
print(nstr5.isdecimal())
print(nstr6.isdecimal())
print(nstr7.isdecimal())
print(nstr8.isdecimal())
print(nstr9.isdecimal())
print(nstr10.isdecimal())

### isalnum()
isalnum() return True if all characters in the string are alphanumeric.

In [None]:
#isalnum() returns true if all characters in the string are alphanumeric
anstr1 = "ABC"
anstr2 = "THWG!"
anstr3 = "A Cheese Cat"
anstr4 = "\u0030"
anstr5 = "\u00B2"
anstr6 = "\u00BD"
anstr7 = "\u00B23455"
anstr8 = "☻"
anstr9 = "½"
anstr10 = "12.34"
anstr11 = "1234"

print(anstr1.isalnum())
print(anstr2.isalnum())
print(anstr3.isalnum())
print(anstr4.isalnum())
print(anstr5.isalnum())
print(anstr6.isalnum())
print(anstr7.isalnum())
print(anstr8.isalnum())
print(anstr9.isalnum())
print(anstr10.isalnum())
print(anstr11.isalnum())

## String Operations

### upper()
Earlier, we went over lower(), so now let's do upper(). upper() returns the uppercase string from the given string.

In [None]:
#We have already gone over .lower() earlier when tokenizing, now let's do .upper()
#Your Code Here

### title()
title() returns the titled version of the string, so every first letter of all the words in each string is now capitalized.

In [None]:
#title()
strt = "georgia tech yellow jackets"
#Your Code Here

### titlecase()
Since the title() function sometimes doesn't work properly as shown in the below examples, we write our own titlecase() method to fix that.

In [None]:
#titlecase()
strtc = "georgia tech's yellow jackets"
print(strtc.title())
print(strtc.capitalize())
#we have to define a function to fix this issue
#re library is used to check for given patterns in strings
import re
def titlecase(s):
  return re.sub(
        r"[A-Za-z]+('[A-Za-z]+)?",
        lambda word: word.group(0).capitalize(),
        s)
#Your Code Here

### split()
split() returns a list of all the words in a string in order from left to right.

In [None]:
#split()
#Your Code Here

In [None]:
#import re
#Your Code Here

### splitlines()
splitlines() splits a string at any line breaks and returns them in a list.

In [None]:
#splitlines()
splistr = "Georgia Tech\nYellow Jackets"
#Your Code Here

### sub()
sub() returns a string where it replaces a string value with another string value.

In [None]:
#sub()
#import re
#Your Code Here

### join()
join() returns a string that connects a list of strings or every word in a string with a value.

In [None]:
#join()
joinsplit = data.split()
#Your Code Here

joinstring = "abcdef"
#Your Code Here

joinstring2 = " abcdef "
#Your Code Here

joinList = ['ab','cd','ef']
#Your Code Here

### strip()
strip() removes any leading and trailing spaces in a string and returns the string.

In [None]:
#strip()
stripstr = "         like          "
#Your Code Here

### rstrip()
rstrip() removes all the occurances of trailing given chars in a string and returns the string.

In [None]:
#rstrip(values)
rstripstr = "cheese,,,,,ssqqqww....."

#Your Code Here

### find()
find() returns the index of the first occurance of a given string in a larger string or in a given range of a larger string.

In [None]:
#find(value, start, end)
#Your Code Here

### rfind()
rfind() returns the index of the last occurance of a given string in a larger string or in a given range of a larger string.

In [None]:
#rfind(value, start, end)
#Your Code Here

### replace()
replace() find occurances in a string, replaces them, and then returns the string

In [None]:
#replace(ogvalue, newvalue)
#Your Code Here

### search()
search() returns True if there is a match anywhere in the string

In [None]:
#search()
#import re
#Your Code Here

### match()
match() return True if there is a match at the beginning of the string

In [None]:
#search()
#import re
#Your Code Here

Georgia Tech Data Visualization Lab 2022 - SR