# Text Data Cleaning Introduction

## Overview of Topics Covered:

## Main Notebooks:
### PPW0 - Text Data Cleaning Introduction

* Python strings
* Built-in string methods
* Type conversions
* Functions and order of operations
* Pandas `.apply()` and `lambda`
* Filtering rows in Pandas

### PPW1 - Ask A Manager - Salary Survey

* *Messy* Data from Google Forms surveys
* ...Like, *really* messy
* CSV vs. Excel files in Pandas
* Renaming columns in a Pandas DataFrame
* Converting strings to integers
* String methods in action
* Manipulating data with `.apply()` and `lambda`
* Filtering rows in Pandas
* Using data from multiple DataFrame columns
* Data preprocessing/integration/enrichment
* `datetime` and `dateutil` (for more, see PPW2 - Doctor Who)
* Pandas `.merge()` to join datasets

## Bonus Notebooks:
### PPW2 - Doctor Who - Actor Timeline

* Extracting text data from a Wikipedia table
* Pandas `.read_html()`
* Regular expressions (Python `re` module)
* Getting time series data out of text data
* Python's `datetime` module and the `dateutil` package
* Timeline visualization

### PPW3 - Goodreads - Book Ratings

* Reading in a badly-formatted dataset
* Working with `bytes` objects
* Cleaning *before* loading into pandas

### PPW4 - Behavioral Risk Factor Surveillance System (BRFSS) 2014

* Extracting text data from PDF files using `pdfminer.six`
* Cleaning up excess whitespace in strings
* Using a dictionary to replace values in a column

### PPW5 - Odds & Ends

* Syntax updates in Pandas

# Python Strings

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## What are Strings?

Strings are basic units of text. They can contain any character. Python recognizes input as a string when it's enclosed in matching quotation marks.

Strings can be combined using the "+" operator. Say we have two string variables that denote the start date and end date of a process, and we want to print them out in a coherent sentence. We can do so like this:

In [None]:
start_date = '2024-05-01'
end_date = '2024-05-31'

#Notice the spaces around "to".
full_string = start_date + " to " + end_date

print(full_string)

### Coding Exercise: "+" in Different Contexts

In [None]:
#Strings
string_var1 = 'book'
string_var2 = 'shelf'

#Integers
int_var1 = 1
int_var2 = 2

#Concatenate strings
print(string_var1 + string_var2)

#Add integers
print(int_var1 + int_var2)

################################################################################
################################################################################

# What happens if you use "+" with string_var1 and int_var1?
string_var1 + int_var1

################################################################################
################################################################################

In the example above, the strings themselves are bounded by quotation marks. The dates use single-quotes, but the `" to "` uses double-quotes. The two types of quotation marks work the same, but they can't be mixed.

### Coding Exercise: Strings and Quotation Marks

Try replacing the double-quotes around the string with single-quotes.

In [None]:
################################################################################
################################################################################

string2 = "This doesn't work unless you use double-quotes to open and close the string."
print(string2)

################################################################################
################################################################################

This is helpful when you want your string to include quotation marks. For strings with both single- and double-quotes, we have to make use of a different character, the backslash: `\`

### Escape Characters

"`\`" is called an "escape character" in Python (and in markdown cells). Placing an escape character before another character in a string will cause a different behavior from the character by itself. In Python strings, "\\t" represents a tab, and "\\n" represents a newline character.

Also, in order for a "\\" to show up correctly in markdown cells\*, it has to have another \\ in front of it. "\`" (grave) is another special character that functions as an escape character in some contexts in markdown, but not in code cells.

\**(Double-click in this markdown cell to see how many backslashes and graves there actually are in the markdown text!)*

In [None]:
backslash_str = "The backslash (\\) allows you to use both 'single' and \"double\" quotes in the same string."
print(backslash_str)

In [None]:
print('line 1\n\n\n\n\n\n\tline 2 (tab-indented)')

## Indexing & Slicing

When you think of container objects in Python, you probably think first of lists and dictionaries (and maybe sets and tuples, depending on how far down the Python rabbit-hole you are).

But strings are also container objects!

A string is a container that holds individual characters in a specific order, much like the elements of a Python list. As such, strings may be indexed in much the same way as lists can be.

As with lists, the indexing numbers are designated by square brackets. A single number will yield the element at that index (starting at 0, not 1), while two numbers separated by a colon will yield indices from the first number up to (but not including) the second number. For example, an index of `[0:3]` returns elements at indices 0, 1, and 2, *but not 3*.

A negative number in one of these positions indicates that the index starts at the end of the string rather than the beginning.

In [None]:
print(start_date)
print(start_date[0])
print(start_date[0:4])
print(start_date[-5:])
print(start_date[-5:-3])

A third number passed in the square brackets indicates a "stride"; it will skip numbers other than those that match that interval. A negative here will reverse the order of the characters returned from the string.

In [None]:
print(start_date[0:4:2])
print(start_date[3::-1])

## String Methods

String objects in Python have built-in functions called "methods" that allow specific operations to be performed without the need to write additional code. There are over 40 string methods, each of which with its own specific task.

Like all methods, these are accessed by using the dot operator after the string variable, the name of the method, and open-and-closed parentheses, i.e.: `string_name.method()`.

https://www.w3schools.com/python/python_ref_string.asp

In [None]:
string_var3 = "My parents and I moved back to California from Berlin in September, 1989. \nI had just turned three years old."

### Case manipulation
Several string methods deal with changing the case of one or more characters in the string. `.upper()` and `.lower()` are the most commonly-used of these.

In [None]:
print(string_var3)

In [None]:
print(string_var3.upper())

In [None]:
print(string_var3.lower())

In [None]:
print(string_var3.capitalize())

In [None]:
print(string_var3.title())

In [None]:
print(string_var3.swapcase())

### Properties
Some string methods allow you to determine the properties of a string.

In [None]:
print(string_var3.isnumeric())

In [None]:
print(string_var3.isalpha())

In [None]:
print(string_var3.count('i'))

### Segmentation
Splitting strings on a particular value can be very important to data cleaning.

In [None]:
print(string_var3.split())

In [None]:
print(string_var3.split(', '))

In [None]:
print(string_var3.splitlines())

In [None]:
print(string_var3.partition('1989'))

### Replacing Values
`.replace()` is perhaps the most important string method for data cleaning, because of its versitility and specificity.

In [None]:
print(string_var3.replace('1989', '2015'))

In [None]:
print(string_var3.replace(' ', ''))

### Coding Exercise: String Methods

Type your own sentence in the following cell, then try out different string methods on it to see how they affect the string. What do you notice about their behavior?

In [None]:
################################################################################
################################################################################

string_var4 = "___"
print(string_var4)

################################################################################
################################################################################

## Converting Types

`str()` lets you change an integer or a float into a string. Python also has `int()` and `float()` for converting to integer and floating point data types, respectively.

`str()` will take whatever is passed to it and convert it literally to a string. `int()` and `float()` are much more particular about their inputs, and may behave differently depending on what arguments are passed. 

In [None]:
print(int('3'))
print(float('3'))
print(int(3.0))
print(int(3.8))
print(float(3))
print(str(3.8))

## Functions

Here are a few functions that will let us manipulate strings.

https://docs.python.org/3.5/tutorial/controlflow.html#documentation-strings

In [None]:
def separate_words(sample_string: str, delimiter=' ') -> list:
	words = sample_string.split(delimiter)
	return words

def add_elipses(sample_string: str) -> str:
  return(sample_string+'...')

def join_words(sample_list: list, delimiter=' ') -> str:
	title = delimiter.join(sample_list)
	return title

input_string = 'Please speak more slowly'
join_words([add_elipses(word) for word in separate_words(input_string)])

### Coding Exercise: Functions as an Assembly Line

Try using each function (`separate_words()`, `add_elipses()`, `join_words()`) by itself, using the same input string. Combine more than one function, and change the order in which the functions are applied.


In [None]:
################################################################################
################################################################################

join_words(input_string)

################################################################################
################################################################################

# Intro to Pandas

Pandas (from "panel data") is a Python library (a collection of modules) that extends Python's basic capabilities by adding support for tabular data.

Pandas's most-used class is the DataFrame. DataFrames are specialized container objects that store data in named columns. Column names must be unique, and all columns must be the same length. A DataFrame is like a spreadsheet, but you don't click in cells and type into it; you use functions and methods to manipulate the data it contains.

Though individual cells are more difficult to edit in a DataFrame than in an application like Microsoft Excel or Google Sheets, the platform lends itself well to making sweeping edits quickly. This is a huge advantage when you want to clean data or to engineer new features in a dataset.

Pandas has integration with statistical functions, so it's easy to get summary statistics for an entire dataset.

Pandas is also capable of [reading data from and writing data to external files](https://pandas.pydata.org/pandas-docs/stable/reference/io.html) in a wide variety of formats, from `.csv` files to `.json` to `.xlsx` (Excel workbooks) and others.

Let's make a very simple DataFrame to illustrate how we can work with text data in a tabular format in Pandas.

In [None]:
fruit_list = ['apple', 'banana', 'cherimoya', 'durian']
qty_list = [2, 5, 2, 1]

#There are many ways to construct a DataFrame. 
fruit_df = pd.DataFrame({'fruit':fruit_list, 'qty':qty_list})

In [None]:
fruit_df

## Filtering Rows in Pandas

If you've used Microsoft Excel, Google Sheets, or other spreadsheet applications, you may have had to apply a filter to isolate a subset of records on a sheet. Pandas also has that capability, but you don't use a dropdown menu to do it - you write the filter in a line of code instead.

If we want to look only at rows that match a certain condition in a DataFrame, we can make use of Boolean comparison operators, `==`, `!=`, `>=`, `<=`, `>`, *and* `<`. These can be used to create a Pandas `Series` object that contains Boolean values for each item in a column.

In [None]:
fruit_df['qty'] == 2

In [None]:
type(fruit_df['qty'] == 2)

In [None]:
fruit_df[fruit_df['qty'] == 2]

We can then use that Series to filter out rows that have a `False` value. If we want to look at the rows of `fruit_df` where "qty" is 2, we can apply a filter by placing that Series object inside square brackets after the name of the DataFrame.

This may look a little weird with the name of the DataFrame showing twice, but it makes sense if you read it as "show only the rows of the dataframe where the values in the dataframe'
s qty column are equal to 2".

We'll explore this feature in greater detail in the `PPW1 - ManagerSalary` notebook.

## EXAMPLE: Cleaning Data using `.apply()` and `lambda`

A quick way to create or modify new columns in a data frame is to use the `.apply()` method with and pass a function, often (but not always) using the `lambda` keyword. `.apply()` is a method of the `pd.Series` Class.

A Lambda function is an "anonymous" function (one that doesn't need to be named since it's used only in a local context). "Lambda" can mean different things in different programming languages, but in Python it allows for a quick ad hoc function to be used without needing to define the function officially with a `def` statement.

Assuming the parameters of a function are appropriate to the data in the Pandas series, we can pass in a single function as an argument of `.apply()` without the need for a lambda statement.

In [None]:
fruit_array2 = fruit_df['fruit'].apply(join_words)

fruit_array2

Using a lambda function allows us to apply a string method to a DataFrame column (assuming all the data are of the same type.)

In [None]:
fruit_array1 = fruit_df['fruit'].apply(lambda x: x.upper())

fruit_array1

In [None]:
fruit_array3 = fruit_df['fruit'].apply(lambda x: join_words(x.upper()))

fruit_array3

# End of Introduction

That's not everything there is to know about strings in Python, but it's a start!

Let's get to some concrete examples of text data-cleaning in action!