<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/CC_BY.png"><br />

Adapted by Sarah Connell, Sean Rogers, Dipa Desai, and Sarah Morrell from two notebooks created by [Nathan Kelber](http://nkelber.com) and Ted Lawless for [JSTOR Labs](https://labs.jstor.org/) under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/). See [here](https://ithaka.github.io/tdm-notebooks/book/all-notebooks.html) for the original versions. Some exercises were adapted from teaching notebooks created by Laura Nelson, University of British Columbia, and from [Python for Everybody](https://www.py4e.com/). Warm thanks to Kate Kryder, Data Analysis & Visualization Specialist at Northeastern University, for helping to develop these notebooks.<br />
___

# Jupyter basics

**Jupyter Notebooks combine interwoven text, data, and code, in a format that runs in a web browser.**
* 'Jupyter' = JUlia, PYthon, and R - but it's really language-agnostic
* Jupyter Notebooks let you run code immediately
* Jupyter Notebooks can connect to a server that has the right environment/dependencies to execute the code successfully.

We'll be using Notebooks hosted by the Google Colab platform, but you can also download [Anaconda Navigator](https://docs.anaconda.com/anaconda/navigator/index.html) to edit notebooks and write Python code directly on your own computers.

## Cells

Similar to the way an essay is composed of paragraphs, Jupyter notebooks are composed of cells. A cell is basically a container for a particular kind of content. There are two kinds of content in Jupyter notebooks:

1. Text Cells—These can contain text, images, video, and the other kinds of explanatory content you might find on a regular website. The cell you're reading right now is a text cell.
2. Code Cells—These can contain code written in a variety of languages.

A **code cell** can be distinguished from a **text cell** by the fact that it contains a pair of brackets on its left. Code cells in Google Colab also have a grey background.

In [None]:
# This is a code cell

A text cell provides information, but a code cell can be executed to perform an action—that is, these cells typically contain code that you can **run**.

That said, the code cell above does not contain any executable content, only a text comment. We can tell that the text in the code cell is a comment because it is prefixed by a ``#``. In Python, if a line is prefaced by a ``#`` then that line is a comment and will not be executed when the code is run. In a Google Colab code cell, comments are green.

When you are learning code, commenting is *essential*; you should add comments to explain to yourself what the code is doing, to mark any questions that you have, and to remind yourself where you left off in your work (an important rule of coding is that your future self will not remember *anything*).

Commenting is also a responsible practice for any code that you might produce and share in the future. Your public-facing comments should be written to explain how the code is expected to behave, point out the rationale behind your design decisions, and mark the places where a user might want to modify the code.

The comments you make as you are learning can be used to explain how the code works to yourself, and to mark any questions you have or places you got stuck.

## Hello World: Your First Code

It is traditional in programming education to begin with a program that prints ``Hello World``. In Python, this is a simple task using the ``print()`` function. A function is a piece of code that performs some action—we will cover functions in more detail below. This function simply prints out whatever is inside the parentheses.

```print("Hello World")```

The code cell below has the ``print()`` function set up to get you started, so all you need to do is write the text you want to print (in this case, "Hello World") inside the quotation marks—make sure not to delete these! We'll cover why the quotation marks are needed soon.

To **execute** or **run** our code, we have a couple of options:

### Option One

Mouse over the code cell you wish to run and then push the "Play" triangle button to the left of the cell.
### Option Two

Click in the code cell you wish to run and press Ctrl + Enter (Windows) or Control + Return (OS X) on your keyboard.

In [None]:
#Fill in "Hello World!" inside of the quotation marks below and then run this block of code
print("")




You'll receive any output from running the code cell underneath the code that you ran. After you click out of the cell, a number will appear in the pair of brackets to the left of the code cell to show the order in which the cell was run. For example, assuming the code cell above is the first one you ran in this notebook, you should see a 1 in the square brackets after you run the code and then click onto another cell.

If your code is complicated or takes some time to execute, you will see an ellipsis (…) in the output line while the code executes. The "Play" button will also spin and show you a "Stop" option, which you can use to interrupt the code if it gets stuck. The first cell you run in a Colab notebook will also take a bit longer than usual, and you will probably need to tell Google that you trust the author of this notebook.

Notice that each time you run a code cell, the number increases in the pair of brackets. This keeps track of the order in which cells were run. Technically, you can run the cells in any order, but it is usually a good idea to run them sequentially from top to bottom, to avoid errors.


## Working in Google Colab
If you want to add a cell in Google Colab, you can do so with the "+ Code" and "+ Text" buttons at the top left, under the main menu bar. Click on any cell to see your options for editing it, including: cut, copy, move up, move down, and delete.

If you want to edit a text cell, double-click on it. Jupyter Notebooks use "markdown," a lightweight langauge for formatting text, and Google Colab will automatically show you a preview of how the markdown will display when it is run. Google Colab has some buttons that will help you fill in the markdown. There are plenty of resources online if you want to add any formatting that isn't provided by the buttons. For more on markdown, see [this guide](https://www.markdownguide.org/cheat-sheet/) or the resource from Google Colab linked below.

When you click out of a text cell in Google Colab, it will automatically go back to displaying the formatted version.

From the "Edit" menu, you can choose "Clear all outputs," if you want a clean copy of the notebook without any output from the code cells.


Try editing a text cell here by double-clicking and then filling in your name. You can also try adding some formatting, such as bold or italics to your name.

My name is:

Google Colab also has some features to help keep Notebooks organized. There is an outline button on the left that will let you see each of the sections in the Notebook. There is also a search option. Google Colab will automatically collapse and "hide" groups of multiple cells in a section, as well as very long code cells. To view these, click on the notification that the cells have been hidden.

Google has very good documentation for working in Colab. Here are a few links you might find useful:
* [Overview of Colab features](https://colab.research.google.com/notebooks/basic_features_overview.ipynb)
* [Guide to Markdown](https://colab.research.google.com/notebooks/markdown_guide.ipynb)
* [Frequently Asked Questions](https://research.google.com/colaboratory/faq.html)
* [Charts in Colab](https://colab.research.google.com/notebooks/charts.ipynb)

### What to do if you get stuck
Don't worry, you can't break anything in this notebook! If you need to, you can always make a new copy from the template folder. Google Colab also has a "playground" mode that will prevent you from making any permanent changes. Go to "Open in playground mode" under the "File" menu to access this (note that "playground" mode is an option only for Colab notebooks you have saved to your own Drive).

If you have a more serious issue with the notebook and you just want to start over, you can go to the "Runtime" menu at the top and hit "Restart runtime" (or "Factory reset runtime" for more serious issues). From this menu, you can also "Interrupt execution" for any processes you want to stop (this is the same functionality as hitting the "Stop" button).


# Python fundamentals

Python is a computer programming language that is widely used in data science and the digital humanities. We'll cover a few Python basics here, giving you the tools to understand some core concepts. If you'd like to learn more, there are many excellent resources online for learning Python, such as [Python for Everybody](https://www.py4e.com/) and the tutorials published by the [Programming Historian](https://programminghistorian.org/en/lessons/?topic=python).

**Making Mistakes is Important**

Every programmer at every skill level gets errors in their code. Making mistakes is how we all learn to program. Programming is a little like solving a puzzle where the goal is to get the desired outcome through a series of attempts. You won't solve the puzzle if you're afraid to test if the pieces match. Remember, you can always restart the Runtime in a notebook if it stops working properly or make a new copy if you misplace an important piece of code. To learn any skill, you need to be willing to play and experiment. Programming is no different.

## Expressions and Operators

One very simple form of Python programming is an **expression** using an **operator**. For example, you might have a simple mathematical statement like:

> 1 + 3

The operator in this case is `+`, sometimes called "plus" or "addition". This particular **expression** is a combination of two **values** (1 and 3) and an **operator** (`+`). In Python, expressions are combinations of values, operators, functions, and variables (more on these last two soon!).

In the code block below, try writing an expression that uses the addition operator.

In [None]:
# Type an expression in this code block, adding your street number to the year you will graduate.
# Then, run the code block.

You can also do subtraction, multiplication, and division, among other mathematical operations. To multiply in Python, you use an asterisk (\*) and to divide, you use a forward slash (/).

In [None]:
# Now try multiplication or division in this code block

You are probably not going to replace the calculator on your phone with Python! But, this example is showing you something about how Python works: here, you are creating an **expression** by combining **values** with an **operator** and running the code to produce **output**.

## Data Types (Integers, Floats, and Strings)

In the above examples, our expressions evaluated to a single numerical value. Numerical values in Python come in two basic forms:

* integer
* float (or floating-point number)

An integer, what we sometimes call a "whole number," is a number without a decimal point that can be positive or negative. When a value uses a decimal, it is called a float or floating-point number. Two numbers that are mathematically equivalent could be in two different data types. For example, mathematically 5 is equal to 5.0, yet the former is an integer while the latter is a float.

Python can also help us manipulate text. A snippet of text in Python is called a string. A string can be written with single or double quotes, but they need to match each other and they need to be the "straight" version (like these), not “curly/smart” quotes (like these).

Single quotes and double quotes do the same thing; many people use single quotes except in cases where a single quote (which is also an apostrophe) appears in the string.

A string can use letters, spaces, line breaks, and numbers. So 5 is an integer and 5.0 is a float, but '5' and '5.0' are strings.

|Familiar Name | Programming name | Examples |
|---|---|---|
|Whole number|integer| -3, 0, 2, 534|
|Decimal|float | 6.3, -19.23, 5.0, 0.01|
|Text|string| "Hello world", '1700 butterflies', " ", '1823'|

The distinction between each of these data types may seem unimportant, but Python treats each one differently. For example, we can ask Python whether an integer is equal to a float, but we cannot ask whether a string is equal to an integer or a float.

To evaluate whether two values are equal, we can use two equals signs between them. The expression will evaluate to either `True` or `False`.

In [None]:
# Run this code cell to determine whether the values are equal
42 == 42.0

True

In [None]:
# Run this code cell to determine whether the values are equal
42 == 42.1

False

In [None]:
# Run this code cell to compare an integer with a string
15 == 'fifteen'

False

In [None]:
# Run this code cell to compare a float with a string
15.0 == '15.0'

False

As another example of how data types behave differently: when we use the addition operator on integers or floats, they are added to create a sum. When we use the addition operator on strings, they are combined into a single, longer string. This is called [concatenation](https://docs.tdm-pilot.org/key-terms/#concatenation).

In [None]:
# Combine the strings 'Hello ' and 'World!'
'Hello ' + 'World!'

'Hello World!'

When we use the addition operator, the values must be all numbers or all strings. Combining them will create an error.

In [None]:
# Try adding a string to an integer
'55' + 23

TypeError: can only concatenate str (not "int") to str

Here, we receive an error because Python doesn't know how to join a string to an integer. Putting this another way, Python is unsure if we want:

>'55' + 23

to become
>'5523'

or
>78

Because these data types operate differently, it is very useful to be able to check which type you're working with. You can do this with the `type()` function. Try running the three code blocks below to check the types for 15, 15.0 and "15".

In [None]:
# Check the type for 15
type(15)

int

In [None]:
# Check the type for 15.0
type(15.0)

float

In [None]:
# Check the type for "15"
type("15")

str

## Variables
We noted above that expressions are combinations of values, operators, and variables, and said that we'd be returning to variables. A **variable** is like a container that stores information. There are many kinds of information that can be stored in a variable, including the data types we have already discussed (integers, floats, and strings). We create (or **initialize**) a variable with an **assignment statement**. The assignment statement gives the variable an initial value.

Variables are stored in your "working memory" during a coding session, which means that they are not saved to your hard drive but that they will persist during your session and will be usable from any cell in your notebook once you have initialized them. When you start a new session or after you clear a notebook, you will need to re-initialize any variables you will be using (that is, you will need to re-run the code with the assignment statements for any variables that you need).


In [1]:
# Initialize an integer variable
# Note that this code doesn't produce any output; it just establishes the variable
new_integer_variable = 6

In [2]:
# Running this code will let you see the value of your variable
new_integer_variable

6

In [None]:
# Add 22 to our new variable
new_integer_variable + 22

28

The value of a variable can be overwritten with a new value. You can test this by changing the value in the first code block above, and then re-running everything.

We can also overwrite the value of a variable using its original value. In the two cells below, we establish a variable and then add 2 to that variable. As we did above, we can then run a line of code that just has the variable name to see the value of our variable

In [None]:
# Creating a variable "cats_in_house"
cats_in_house = 1
cats_in_house

1

In [None]:
# Adding 2 to our initial variable
cats_in_house = cats_in_house + 1
cats_in_house

2

Whenever you create a new variable, you can always confirm what data type it is with the `type()` function. For example:

In [None]:
#Checking the type of the variable cats_in_house
type(cats_in_house)

int

You can also run operations with variables that are strings:

In [None]:
# Initialize a string variable and concatenate another string
new_string_variable = 'Hello'
new_string_variable + 'World!'

'HelloWorld!'

Is that the result you were expecting? Modify the code below to see if you can improve the results. This is an important example of how coding works: the computer will do *exactly* what you tell it to, no matter how obviously "wrong" that might seem to a human.

In [None]:
# Modify this code to produce the results: "Hello World!"
new_string_variable = 'Hello'
new_string_variable + 'World!'

'HelloWorld!'

It can be difficult to keep track of which variables you've initialized, but, fortunately, there is a trick you can use. Running ```%whos``` will give you the basic details for the variables that are active in your current session.

In [None]:
%whos

Variable               Type        Data/Info
--------------------------------------------
cats_in_house          int         2
new_integer_variable   int         6
new_string_variable    str         Hello
remove_stopwords       function    <function remove_stopwords at 0x7d2787494ca0>


You can also view variables stored in your working memory by clicking the {*x*} on the left side of the notebook.

### Variable naming guidelines
You can create a variable with almost any name, but there are a few guidelines that are recommended. First, variable names should be clear and descriptive.

For example, if we create a variable that stores the day of the month, it is helpful to give it a name that makes the value stored inside it obvious, something like `day_of_month`. From the computer's perspective, we could call the variable almost anything (`potato`, `bananafish`, `flat_tire`). As long as we are consistent, the code will execute the same. When it comes time to read, modify, and understand the code, however, it will be confusing to you and others. Consider this simple program that lets us change the `hours_per_week` variable to compute the full-semester pay for an employee.

In [None]:
# Compute the semesterly wages for an employee
hours_per_week = 20
rate = 24
weeks_per_semester = 14

hours_per_week * rate * weeks_per_semester

6720

We could write a program that is logically the same, but uses confusing variable names.

In [None]:
hotdogs = 20
sasquatch = 24
example = 14

hotdogs * sasquatch * example

6720

This code gives us the same answer as the first example, but it is confusing. Not only does this code use variable names that make no sense, it also does not include any comments to explain what the code does. It is not clear that we would change `hotdogs` to set a different number of hours per week. It is not even clear what the purpose of the code **is**. As code gets longer and more complex, having clear variable names and explanatory comments is very important.

To recap: variable names should be clear, brief, and descriptive, so that you and everyone else who uses your code can easily remember them and recognize what they are meant to represent.

### Variable naming rules

In addition to being descriptive, variable names must follow three basic rules:

1. Must be one word (no spaces allowed)
2. Only letters, numbers and the underscore character (\_) are allowed
3. Cannot begin with a number

Additionally, there are some "reserved words" in Python that you are not allowed to use for the names of variables (or for any other identifiers that you choose). These words are "reserved" because they are already used in the actual Python code. You can see a list of these words [here](https://www.w3schools.com/python/python_ref_keywords.asp). You should also be careful never to use Python function names (like `print`) as your variable names.

Finally, it's important to note that Python is case sensitive: ```new_integer``` and ```New_Integer``` are two completely different variables.

In [None]:
# Which of these variable names are acceptable?
# "Comment out" the variables that are not allowed in Python by putting a # before each line with an invalid variable name
# Then run this cell to check if the variable assignment works.
# If you get an error, the variable name is not allowed in Python.

$variable = 1
#a variable = 2
a_variable = 3
#4variable = 4
variable5 = 5
variable6 = 6
variAble = 7
Avariable = 8

SyntaxError: cannot assign to expression here. Maybe you meant '==' instead of '='? (<ipython-input-3-ba3d5e39a010>, line 11)

## Functions

Many different kinds of programs often need to do very similar operations. Instead of writing the same code over and over again, you can use a **function**. Essentially, a function is a small snippet of code that can be quickly referenced and reused, and that does some specific task.

There are three kinds of functions:
* Native functions built into Python
* Functions others have written that you can import
* Functions you write yourself

We have already used a couple of functions, `type()` and `print()`:

In [None]:
type("Hello World!")

str

In [None]:
print('Hello World!')

Hello World!


The above example just prints a string. We could also define a variable with our chosen input string and then *pass* that variable into the `print()` function. It is common for functions to take an input, called an argument, that is placed inside the parentheses.

In [None]:
# Define a string and then print it
our_string = 'This is a string!'
print(our_string)

This is a string!


Let's take a look at a few more functions, so you can get some practice using and modifying them. For example, the `len()` function returns the number of items in its argument. When you use `len()` with a string, it will tell you how many characters are in the string:

In [None]:
len(our_string)

17

In the code block below, try overwriting the `our_string` variable with a different string, then use `len()` to find its length:

In [None]:
# Add code here to overwrite our_string and then find its length with len()


Another useful function is the `input()` function for taking user input. When this function is called, the user is presented with a box for entering a response; after the user hits Enter or Return, the program continues. The text entered by the user is stored as a string.

In [None]:
# A program to greet the user by name
print('Hi. What is your name?') # Ask the user for their name
user_name = input() # Take the user's input and put it into the variable user_name
print('Pleased to meet you, ' + user_name) # Print a greeting with the user's name

Hi. What is your name?

Pleased to meet you, 


We defined a string ```user_name``` to hold the user's input. We then called the `print()` function to print the concatenation of 'Pleased to meet you, ' and the user's input that was captured in the variable ```user_name```. Remember that we can use a ```+``` to **concatenate** or join these strings together.

Let's look at another set of useful functions, ones that you can use to transform variables from one data type to another.

Remember how important data types are! We can concatenate any number of strings together, but we cannot add a string to an integer.

In [None]:
# Trying to concatenate a string with an integer causes an error
print('There are ' + 7 + 'continents.')

TypeError: can only concatenate str (not "int") to str

However, there are a few functions that can help with this. We can transform one variable type into another variable type with the `str()`, `int()`, and `float()` functions. Let's convert the integer above into a string so we can concatenate it.

In [None]:
print('There are ' + str(7) + ' continents.')

There are 7 continents.


# Working with Strings in Python

Python's ability to do powerful things with text is one of it's major advantages. We don't have time to go into a full-fledged text analysis tutorial here, but we can show a few small examples of ways you can use Python to manipulate strings. Here' we'll introduce **string methods**, which are like functions that work specifically on strings.

To see how these string methods work, we'll use a few sentences from E. L. Konigsburg's novel *From the Mixed-Up Files of Mrs. Basil E. Frankweiler*:

“More than a quarter of a million people come to that museum every week. They come from Mankato, Kansas, where they have no museums and from Paris, France, where they have lots. And they all enter free of charge because that’s what the museum is: great and large and wonderful and free to all. And complicated.”

First, we'll initalize a variable called `sample_text` whose value is a string with the quotation above:

In [None]:
sample_text = "More than a quarter of a million people come to that museum every week. They come from Mankato, Kansas, where they have no museums and from Paris, France, where they have lots. And they all enter free of charge because that’s what the museum is: great and large and wonderful and free to all. And complicated."

In the code block below, use the `print()` function to print our new string variable.

In [None]:
# Print the `sample_text` variable here
print(sample_text)


More than a quarter of a million people come to that museum every week. They come from Mankato, Kansas, where they have no museums and from Paris, France, where they have lots. And they all enter free of charge because that’s what the museum is: great and large and wonderful and free to all. And complicated.


Now, let's try out a few string methods. For example, we can use the `lower()` string method to convert text into all lowercase and then use an assignment statement to make a new variable that is a lowercased version of `sample_text`. Don't worry too much about the syntax for the string method—just focus on what the output looks like.

In [None]:
# Lowercasing our variable with `lower()`
sample_text_lowercased = sample_text.lower()
print(sample_text_lowercased)

more than a quarter of a million people come to that museum every week. they come from mankato, kansas, where they have no museums and from paris, france, where they have lots. and they all enter free of charge because that’s what the museum is: great and large and wonderful and free to all. and complicated.


Note that we haven't changed the value of our original variable, but have instead created a new one. You can confirm this by running the code cell below to print `sample_text` again. An important part of working with Python is making sure you keep track of the changes you have made to your variables! This is where `%whos` can be very useful.

In [None]:
print(sample_text)

More than a quarter of a million people come to that museum every week. They come from Mankato, Kansas, where they have no museums and from Paris, France, where they have lots. And they all enter free of charge because that’s what the museum is: great and large and wonderful and free to all. And complicated.


We can also create an uppercased version of our string (perhaps we want the text to be more emphatic!).

In [None]:
# Uppercasing our variable with `upper()`
sample_text_uppercased = sample_text.upper()
print(sample_text_uppercased)

MORE THAN A QUARTER OF A MILLION PEOPLE COME TO THAT MUSEUM EVERY WEEK. THEY COME FROM MANKATO, KANSAS, WHERE THEY HAVE NO MUSEUMS AND FROM PARIS, FRANCE, WHERE THEY HAVE LOTS. AND THEY ALL ENTER FREE OF CHARGE BECAUSE THAT’S WHAT THE MUSEUM IS: GREAT AND LARGE AND WONDERFUL AND FREE TO ALL. AND COMPLICATED.


We can even swap the cases, though it's harder to think of a research application for this one.

In [None]:
# Swapping the cases in our variable with `swapcase()`
sample_text_case_swapped = sample_text.swapcase()
print(sample_text_case_swapped)

mORE THAN A QUARTER OF A MILLION PEOPLE COME TO THAT MUSEUM EVERY WEEK. tHEY COME FROM mANKATO, kANSAS, WHERE THEY HAVE NO MUSEUMS AND FROM pARIS, fRANCE, WHERE THEY HAVE LOTS. aND THEY ALL ENTER FREE OF CHARGE BECAUSE THAT’S WHAT THE MUSEUM IS: GREAT AND LARGE AND WONDERFUL AND FREE TO ALL. aND COMPLICATED.


And, there is a `count()` method that will return the number of times a specified value appears in the string. Let's see how many times "and" is used in our sample text.

In [None]:
# Counting the number of times a value appears in our variable with `count()`
counted_text = sample_text.count("and")

print(counted_text)

4


Scroll back up to the sample and count the occurences of "and". Is the result above what you would expect? What happened here?

Let's try again, but with the lowercased string variable:

In [None]:
counted_text = sample_text_lowercased.count("and")

print(counted_text)

6


This last example brings together a few things we've seen already:


*   First, we use the `print()` function to print a string
*   Then, we define the `quote` variable, whose value is user-provided input from the `input()` function
*   Then, we define and print another variable called `quote_length` whose value is the outcome of running the `len()` function on `quote`
*   Then, we print another string
*   And finally we print the output from using the `count()` method to count the number of times a specified character ("e" in this case) appears in the `quote` string.

Note that we have printed several lines here, but only defined two new variables. Sometimes, you'll just want to run code and get some output; other times you'll want to store the results of your code in a variable. The main difference is whether or not you want to be able to use that output again.

In [None]:
print("What is your favorite quotation?")
quote = input()
print("Let's count the characters in that quotation:")
quote_length = len(quote)
print(quote_length)
print("And, how many of those characters are the letter e?")
print(quote.count("e"),case=True)

What is your favorite quotation?

Let's count the characters in that quotation:
0
And, how many of those characters are the letter e?
0


There are many more things you can do with Python, but hopefully this gave you a sense of the possibilities, as well as some of the ways you need to think when you work with programming languages like Python. In this lesson, we've covered several key concepts for working in Python: expressions, operators, data types, variables, and functions.

#Exploring a Dataframe

### Libraries and Modules
So far, we've focused mostly on using built-in functions. Now, let's talk about importing others' functions and writing them ourselves.

While Python comes with many functions, there are thousands more that others have written. Adding them all to Python would create mass confusion, since many people could use the same name for functions that do different things. The solution then is that functions are stored in [modules](https://constellate.org/docs/key-terms/#module) that can be **imported** for use. A module is a Python file (extension ".py") that contains the definitions for the functions written in Python. These modules can then be collected into even larger groups called [packages](https://constellate.org/docs/key-terms/#package) and [libraries](https://constellate.org/docs/key-terms/#library). Depending on how many functions you need for the program you are writing, you may import a single module, a package of modules, or a whole library.

The general form of importing a module is:
`import module_name`

To access one of the functions in the module, you have to specify the name of the module and the name of the function, separated by a dot (also known as a period). This format is called **dot notation**.


In this section we will use several imported functions. One of those is called the [`pandas`](https://constellate-org.ezproxy.neu.edu/docs/key-terms/#pandas/) library. The `pandas` library includes functions that let you visualize and analyze data in Python. Dataframes are the main structures to view and manipulate data within the `pandas` library.

In [None]:
#Installing Packages
!pip install contractions
!pip install demoji

#downloading NLTK tools
import nltk
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords
from nltk import word_tokenize

#Importing other python packages
import re
import contractions
import string
import demoji
import pandas as pd
import numpy as np
#Googly stuff
from google.colab import files


Collecting contractions
  Downloading contractions-0.1.73-py2.py3-none-any.whl (8.7 kB)
Collecting textsearch>=0.0.21 (from contractions)
  Downloading textsearch-0.0.24-py2.py3-none-any.whl (7.6 kB)
Collecting anyascii (from textsearch>=0.0.21->contractions)
  Downloading anyascii-0.3.2-py3-none-any.whl (289 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m289.9/289.9 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pyahocorasick (from textsearch>=0.0.21->contractions)
  Downloading pyahocorasick-2.1.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (110 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.7/110.7 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pyahocorasick, anyascii, textsearch, contractions
Successfully installed anyascii-0.3.2 contractions-0.1.73 pyahocorasick-2.1.0 textsearch-0.0.24
Collecting demoji
  Downloading demoji-1.1.0-p

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Now that we have all of the packages we need installed, we can download our data.
Here, we:
- Initialize a variable named url that serves as the link to our dataset.
- Reformat and split the url
- Pass the reformatted url to the read_csv function from pandas.

It is important to note that pandas has built in functions to read in a variety of data formats, including excel spreadsheets, [json,](https://https://constellate-org.ezproxy.neu.edu/docs/key-terms/#json) and text files.

In [None]:
#reading in a file
url='https://drive.google.com/file/d/1cD7HhPgSeo-VRYtXgq7QhvjHPG804y-2/view?usp=sharing'
url='https://drive.google.com/uc?id=' + url.split('/')[-2]
data = pd.read_csv(url)

It is good practice to examine the shape of your data when it is loaded as a , dataframe. This can help to identify errors, and if the data is unknown to you, it is a necessary first step to begin examining it. Here, we use the ``.shape`` method to examine the shape of our data ``dataframe``. The first value represents the number of rows, while the second value represents the number of columns. Here, we observe that ``data`` has
N=1,000 rows and N=6 columns.

In [None]:
#exploring the the shape of the dataframe
data.shape

(1000, 6)

Now that we have an understanding of the ``dimensionality`` of the data -- its height and width -- we can take a peak at the first N rows of the data using the ``.head(N)`` method and see what the column values look like.

In [None]:
#exploring head of dataframe
data.head()

Unnamed: 0.1,Unnamed: 0,tweet_text,harm_category,possibly_sensitive,tweet_language,like_count
0,0,Obligatory baby sloth selfie. 🍼🦥 Snooze is 11 ...,a,False,en,0
1,1,Sloth Selfie,a,False,en,1
2,2,A sloth selfie! 🦥🤳 Happy #NationalSelfieDay!,u,False,en,0
3,3,Sloth Selfie A.K.A Slofie :) Photo By Nicolas ...,a,False,en,0
4,4,Sloth selfie!,a,False,en,1


In [None]:
data.head(10)

Unnamed: 0.1,Unnamed: 0,tweet_text,harm_category,possibly_sensitive,tweet_language,like_count
0,0,Obligatory baby sloth selfie. 🍼🦥 Snooze is 11 ...,a,False,en,0
1,1,Sloth Selfie,a,False,en,1
2,2,A sloth selfie! 🦥🤳 Happy #NationalSelfieDay!,u,False,en,0
3,3,Sloth Selfie A.K.A Slofie :) Photo By Nicolas ...,a,False,en,0
4,4,Sloth selfie!,a,False,en,1
5,5,"When trekking through the Amazon rainforest, y...",a,False,en,28
6,6,That is great! Not quite as great as this al...,a,False,en,1
7,7,A sloth selfie! 🦥🤳 Happy #NationalSelfieDay!,u,False,en,0
8,8,A sloth selfie! 🦥🤳 Happy #NationalSelfieDay!,u,False,en,0
9,9,A sloth selfie! 🦥🤳 Happy #NationalSelfieDay!,u,False,en,0


We can also take a peak at the last N rows of the data using the ``.tail(N)`` method and see what the column values look like.

In [None]:
#exploring tail of dataframe
data.tail(2)

Given the importance of understanding the data type for each column in our ``data`` dataframe, we can examine these data type for each column by using the ``dtypes`` method.

In [None]:
#examining data types
data.dtypes

Unnamed: 0             int64
tweet_text            object
harm_category         object
possibly_sensitive      bool
tweet_language        object
like_count             int64
dtype: object

Exploring the distribution of values within a column is another important task. Here, we call the ``value_counts()`` function on the ``harm_category`` column of our ``data`` dataframe. We can examine other columns by changing the name of the column that is located between the single quotes inside the brackets.

In [None]:
#using value counts to examine a column
data['harm_category'].value_counts()

a     485
e      95
u      29
b      15
aa      1
Name: harm_category, dtype: int64

Sometimes we might want to select a single column to see what the values look row by row. We can select a single column by putting that column name inside single quotes in brackets beside the dataframe name.

In [None]:
#selecting a single column
data['harm_category']

0      a
1      a
2      u
3      a
4      a
      ..
995    a
996    a
997    a
998    a
999    a
Name: harm_category, Length: 1000, dtype: object

To select multiple columns, we need to separate the individual column names wrapped in quotes with commas, and add a second set of brackets. The addition of the second set of brackets allows us to pass the list of column names as a ``list`` that the dataframe then selects columns from.

In [None]:
#selecting multiple columns
data[['harm_category','like_count']].head(2)

Unnamed: 0,harm_category,like_count
0,a,0
1,a,1


Understanding the textual content of data is crucial. The presence of particular keywords can be a powerful signal to a certain kind of content. It is important to use your intuition as a researcher and analyst to determine the kinds of words or phrases that may help a model differentiate between classes. Here, we use the ``.loc`` method with the ``.str.contains()`` function on the ``tweet_text`` column to search for any rows that contain the word ``slofie`` - a combination of sloth and selfie. Notice the ``case`` parameter that allows us to determine whether we want the search to be case sensitive.

In [None]:
#String contains code to search single words
data.loc[data['tweet_text'].str.contains('slofie',case=False)]

We might also want to search for multiple words that represent the same category. Here, we use the ``.loc`` method with the ``.str.contains()`` function on the ``tweet_text`` column to search for any rows that contain the word ``slofie`` or ''slofies'' because for our purposes, these terms are very similar. To transition from searching for a single word or phrase to multiple words or phrases, we utilize the same syntax as a single word search, but separate the search terms wrapped in single quotation marks with a vertical bar. The vertical bar essentially acts as an ``or`` operator.

In [None]:
#using string contains to search multiple words
data.loc[data['tweet_text'].str.contains('slofie|selfie',case=False)]

Unnamed: 0.1,Unnamed: 0,tweet_text,harm_category,possibly_sensitive,tweet_language,like_count
3,3,Sloth Selfie A.K.A Slofie :) Photo By Nicolas ...,a,False,en,0


To store the results, we can initialize a variable set to populate as the subset of the ``data`` dataframe we partitioned. This can be useful for examining subsets of data, and can be implemented with single or multiple word ``.loc`` searches. Here, we initialize a variable ``slofie_slofies`` that will initialize as a dataframe containing all the rows of our data that contain the terms ``slofie`` or ``slofies``.

In [None]:
#using string contains to search multiple words and make a new column
slofie_slofies = data.loc[data['tweet_text'].str.contains('slofie|slofies',case=False)]

Another common tasks for analysts and researchers is to create a new column
based on the presence of a value in an existing column. This allows us to do things like one-hot-encode data for machine learning, or analyze the counts of certain terms of phrases in our data. Here, we initialize a new column ``is_funny`` and set it equal to zero. Next, we use a ``.loc`` and ``.str.contains()`` search to locate the rows that contain these words. If you compare this search with the one above, you may notice we declare the new column by including a column name ``is_funny`` before the closing bracket, and an ``=1`` after it. This is tells the program override the column value of ``0`` in ``is_funny`` with a ``1`` if the row contains those phrases.

In [None]:
#using loc to make new columns
data['is_funny']=0
data.loc[data['tweet_text'].str.contains('funny',case=False), 'is_funny'] = 1

We might find ourselves in situations where we would like to have a copy of the labeled data on our local machine. This could be to provide the data to an external stakeholder or to analyze the data using another software. The code below uses the ``to_csv`` function to convert our dataframe to a csv file named ``pseudolabeled_data.csv``. Then, we use the ``download()`` function from the ``files`` package to download the newly created file.

In [None]:
#exporting a file
data.to_csv('our_data.csv')
files.download('our_data.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

#Python Tools Used in Workshop Explained



In [None]:
#Installing Packages
!pip install contractions
!pip install demoji

#downloading NLTK tools
import nltk
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords
from nltk import word_tokenize

#Importing other python packages
import re
import contractions
import string
import demoji
import pandas as pd
import numpy as np
import sklearn as sk

#Importing SKlearn modules
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix

#Googly stuff
from google.colab import files




[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


#User Defined Functions for Text Processing

Fixing contractions is one important step.
This function:
- Takes the input text and iterates over each word, making it a string
-  Copies the strings to a new list
- Expands strings containing contractions to their full form
- Returns the now expanded words as a list


In [None]:
def fix_contractions(text):
    text = [str(i) for i in text]
    expanded = [i for i in text]
    no_contractions = [contractions.fix(i) for i in expanded]
    return no_contractions

This function uses the ``re`` package to remove usernames and links from the data, it works by:
- Substituting symbols in the list of characters for empty space
- Substituting any token in the list of characters that starts with https for empty space
- Substituting usernames preceded by a retweet indicator with empty space

In [None]:
def remove_usernames_links(text):
    text = re.sub('@[^\s]+','',text)
    text = re.sub('http[^\s]+','',text)
    text = re.sub('RT','',text)
    return text

This function uses the ``demoji`` package to replace emojis in the text with empty space and returns the clean string.


In [None]:
def remove_emoji(text):
    result = demoji.replace(text, "")
    return result

This function uses a list comprehension to join elements of the text string that are not included in the ``string.punctuation`` module.

In [None]:
def remove_punctuation(text):
    punctuationfree="".join([i for i in text if i not in string.punctuation])
    return punctuationfree

This function uses a list comprehension to join elements of the text string after lemmatizing them with the ``lemmatizer()`` function.

In [None]:
def lemmatize_text(text):
    output=[lemmatizer.lemmatize(i) for i in text]
    return output

This function uses a list comprehension to join elements of the text string that are not included in the ``stopwords.english`` module that we initialized a ``stopwords`` variable for.

In [None]:
def remove_stopwords(text):
    output= [i for i in text if i not in stopwords]
    return output

Here, we bring it all together by creating a function to preprocess our data
The function
- Expands contractions using the ``contractions`` library
- Removes usernames and links from the text string
- Removes emojis from the text string
- Removed punctuation from the text string
- Sets the characters to lowercase
- tokenizes the text string
- Removes stopwords from the tokens
- Lemmatizes each token
- Joins the tokens back together to form a sentence
- Returns our modified dataframe

In [None]:
def preprocess(data):
    data['no_contractions']= fix_contractions(data['tweet_text'])
    data['no_usernames_links'] = [remove_usernames_links(i) for i in data['no_contractions']]
    data['no_emoji'] = [remove_emoji(i) for i in data['no_usernames_links']]
    data['no_punctuation'] = [remove_punctuation(i) for i in data['no_emoji']]
    data['tweet_lower']= data['no_punctuation'].apply(lambda x: x.lower())
    data['word_tokens'] = [word_tokenize(i) for i in data['tweet_lower']]
    data['no_stopwords']= data['word_tokens'].apply(lambda x:remove_stopwords(x))
    data['lemmatized']= [lemmatize_text (i) for i in data['no_stopwords']]
    data['joined_tokens'] = data['no_stopwords']
    data['joined_tokens'] = data['no_stopwords'].str.join(" ")
    return data