# Introduction to Python: Dataframes

#### Working with Text Metadata and Full Texts in a Python Dataframe

**A Reproducible Research Workshop**

(A Collaboration between Dartmouth Library and Research Computing)

[*Click here to view or register for our current list of workshops*](http://dartgo.org/RRADworkshops)

*This notebook created by*:
+ Version 1.0: Jeremy Mikecz, Research Data Services (Dartmouth Library) drawing on some existing tutorials (cited below).
+ Version 2.0: ???
<!--
+ Some of the inspiration for the code and information in this notebook was taken from https://www.w3schools.com/python/python_intro.asp -- This is a great resource if you want to learn more about Python!-->

This is **Notebook 3** of 3 for the **Introduction to Text Analysis in Python** workshop:
+ Notebook 1: The Basics - getting started with Python
+ Notebook 2: Working with Texts (and other data) - importing, reviewing, and modifying texts and other data
+ **Notebook 3: Dataframes - importing texts and other data, placing this data into a dataframe, and then modifying, analyzing, visualizing, and exporting this data**

**Table of Contents**

1. Create a dataframe from scratch [creating a dataframe - option 1]
8. Read a dataframe from a csv [creating a dataframe - option 2]
9. working with a dataframe:
    + get summary data from a dataframe
    + access parts of a dataframe
    + subset, filter, and sort dataframe
    + modify a dataframe
    + split-apply-combine
4. data visualization - brief introduction to seaborn & matplotlib
5. iterate through text files and create a dataframe [creating a dataframe - option 3]

## Pythons and Dataframes

Nearly all researchers work with tabular data at one time or another. In this lesson, we will practice with **dataframes**, a Python data structure designed to work with tabular data, using the **Pandas library**. We will create our own dataframes, read in data from .csv files to a dataframe, subset and combine dataframes, and add or modify columns (variables) and observations (rows). We will also examine how these tasks work with messy and large datasets (i.e. millions or rows).

**Data tables, like Python dataframes, are often associated with quantitative or categorical data. However, text analysts often find dataframes useful for storing metadata from texts as well as full texts themselves and word lists for each document.**

To learn more about Pandas, visit the webpage for the [Pandas library](https://pandas.pydata.org/).

For further practice with Python and Pandas dataframes, visit:

1. Software Carpentries [Pandas DataFrames](http://swcarpentry.github.io/python-novice-novels/08-data-frames/index.html) lesson.
2. The **Pandas 1**, **Pandas 2**, and **Pandas 3** [tutorials offered by Constellate](https://constellate.org/tutorials).
3. Chapter 3, ["Data Analysis (Pandas)"](https://melaniewalsh.github.io/Intro-Cultural-Analytics/03-Data-Analysis/01-Pandas-Basics-Part1.html) in Melanie Walsh's *Introduction to Cultural Analytics & Python*.

Note: All three platforms above offer excellent lessons for learning how to perform various other tasks in Python. Check them out.

Finally, don't worry about memorizing all the functions and syntax needed to work with Pandas. You can use the [Python Pandas Cheat Sheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf) as a handy guide to remind you of some of these key functions.


<h2 style="text-align:center;font-size:300%;">Dataframes: The Basics<h1> 
  <img src="https://media.geeksforgeeks.org/wp-content/cdn-uploads/creating_dataframe1.png" style="width:%40;">


A dataframe is a particular data structure that includes:
+ **rows** that record each observation of data
+ **columns** that record different attributes / measures / variables for each observation
+ **column labels** that describe the data recorded in each column
+ **index labels** that differentiate between each observations

## I. Create a dataframe

1. First, we need to import the [**Pandas**, Python's data analysis library](https://pandas.pydata.org/) that allows us to work with dataframes. It is almost universal practice among Python users to import Pandas under the name "pd" to serve as a useful abbreviation we can use when calling Pandas functions. 

2. There are many ways to create a dataframe:
    +  Perhaps the most common way is to **load an existing dataset saved as a *.csv*** (Comma Separated Values) file and read it in directly as a Pandas dataframe. We will introduce that method in the next section.
    +  **Import data from other files and convert it into a dataframe.** For example, you could write a Python script iterating over 1000s of text files and create a Pandas dataframe with basic information about each text (i.e. file names, number of words, etc.)
    + **We can also create a dataframe from scratch** (usually only used to demo or practice with dataframes):

**In the example below, we will create a dataframe from scratch** using data showing outcomes of the men's World Cup in recent decades.

In the previous notebook, we worked extensively with lists, which offer a simple way to store one-dimension series of data. Dataframes, however, allow us to store two-dimensional data in which we can store multiple attributes (columns) of data for each observation (row). We can think of dataframes, thus, as a list of lists.

In [None]:
#creating a dataframe from a list of dictionaries
# this example comes from Constellate's Pandas 1 tutorial
  ## https://lab.constellate.org/ilr-review-primary/notebooks/tdm-notebooks-2023-04-19T13%3A37%3A13.281Z/pandas-1.ipynb
wcup = pd.DataFrame({"Year": [2022, 
                              2018, 
                              2014, 
                              2010, 
                              2006, 
                              2002, 
                              1998, 
                              1994, 
                              1990,
                              1986], 
                     "Champion": ["Argentina", 
                                  "France", 
                                  "Germany", 
                                  "Spain", 
                                  "Italy", 
                                  "Brazil", 
                                  "France", 
                                  "Brazil", 
                                  "Germany", 
                                  "Argentina"], 
                     "Host": ["Qatar", 
                              "Russia", 
                              "Brazil", 
                              "South Africa", 
                              "Germany", 
                              "Korea/Japan", 
                              "France", 
                              "USA", 
                              "Italy", 
                              "Mexico"]
                    })

In [None]:
#another way to create a dataframe from the same data: using a list of lists (rows)
wcup_lists = [[2022, "Argentina", "Qatar"], [2018, "France", "Russia"], [2014, "Germany", "Brazil"]]
wcup2 = pd.DataFrame(wcup_lists, columns = ["Year", "Champion", "Host"])
wcup2.head()

3. In the above example, we are directly inputting a dataframe using the Python **dictionary** data structure. For more on Python dictionaries [click here](https://www.w3schools.com/python/python_dictionaries.asp). 

*This is beyond the scope of this lesson, but if you are curious, you can learn more about Python dictionaries from [w3schools](https://www.w3schools.com/python/python_dictionaries.asp) or [GeeksforGeeks](https://www.geeksforgeeks.org/python-dictionary/)*

Let's view what the resulting dataframe looks like by typing the variable name we used to save our World Cup dataframe into memory:

We will come back to this dataset below.

## II. Subset a Data Frame

4. Let's take a look at our World Cup data again. Use the **.head()** method to output the first 5 rows of this dataset (wcup):

<code>df_name.head(n)</code><p>declaring <i>n</i> rows, default is 5</p>


5. We can use the **.iloc[..., ...]** method to extract particular columns, rows, or cells. For example to retrieve the first row or first column, we would do the following:

`df_name.iloc[rowstnum:rowendnum, colstnum:colendnum]` 

6. How would you retrieve the value found in the 2nd row of the 2nd column?

In [None]:

#remember the 1st item of a list or series in Python is number 0. Thus, the second item is number 1.

7. We can use **.loc[..., ...]** to retrieve values, columns, or rows by their labels. For example, if we wanted to retrieve all info from the "Champion" column, we would simply run:

`df_name.loc[rownames, colnames]`

7b. To save a subsetted dataframe we need to assign it to a new name or replace the original df name.

*See Appendix I for more ways to access parts of a dataframe.*

## III. Read a Dataframe

Besides the basic Men's World Cup dataset created above, in this lesson we will also be working with:
 
+ a corpus of **450 novels written in English, French, and German**. This corpus was created by Andrew Piper of the **.txtlab** at McGill University ([documentation for this dataset is available here](https://txtlab.org/2016/01/txtlab450-a-data-set-of-multilingual-novels-for-teaching-and-research/)). This corpus includes separate files for each text and a summary csv with metadata for each.
+ the **Hollywood Film Dialogue dataset** (also used in Walsh. Original data from Hannah Anderson and Matt Daniels, ["Film Dialogue from 2,000 screenplays, Broken down by Gender and Age."](https://pudding.cool/2017/03/film-dialogue/))
<!--+ *one more complex and messy dataset*: a dataset containing **metadata for all New York Times articles for the month of April 2023** (including the title, a summary description, and the first paragraph of each article). This data was downloaded using the [New York Times Archive API](https://developer.nytimes.com/apis).-->


8. To read data in, we will use the Pandas [**read_csv** function](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html). First, however, we need to create a path to the folder where our datasets are saved. We will use the **Path** function from the **pathlib** library.

In [None]:
from pathlib import Path
Path.cwd()

In [None]:
# the command below uses .expanduser to indicate that "~" calls a user's home directory
novels_path = Path("~/shared/RR-workshop-data/text_corpora/novels").expanduser() 
list(novels_path.iterdir())

9. Now we can read in the **novels dataset** using Pandas read_csv. function:

`dfname = pd.read_csv(path)`

10. We can print out this dataframe below in Jupyter Notebooks by simply typing the name of the dataframe (note: if you place any code below it, you need to wrap it in a **print()** command).

Here is a quick summary of what you see above from Melanie Walsh's online book, [*Introduction to Cultural Analytics & Python*](https://melaniewalsh.github.io/Intro-Cultural-Analytics/03-Data-Analysis/01-Pandas-Basics-Part1.html):

There are a few important things to note about the DataFrame displayed here:

+ Index
    - The bolded ascending numbers in the very left-hand column of the DataFrame is called the Pandas Index. You can select rows based on the Index.
    - By default, the Index is a sequence of numbers starting with zero. However, you can change the Index to something else, such as one of the columns in your dataset.

+ Truncation
    - The DataFrame is truncated, signaled by the ellipses in the middle ... of every column.
    - The DataFrame is truncated because we set our default display settings to 100 rows. Anything more than 100 rows will be truncated. To display all the rows, we would need to alter Pandas’ default display settings yet again.

+ Rows x Columns
    - Pandas reports how many rows and columns are in this dataset at the bottom of the output (n rows x n columns).
    - This is very useful!

*In addition, I would also add that in the preview of the dataframe above - at least as viewed within JHub - you can click on a pen/highlighter icon to view the whole dataframe.*


## IV. Get summary data to learn more about a dataframe

11. Now, often the first thing we want to do with a new dataset is to explore the size of the dataset and the type and range of data it contains. Run the following functions and then use hashtags ```#``` to add in comments about what each does.

````
dfname.head()
dfname.head(12)
dfname.tail()
dfname.shape()
dfname.info()
dfname.sample(10)
dfname.describe()
dfname.columns
dfname['colname'].describe()
dfname['colname'].value_counts()
```



<div class="alert alert-info" role="alert"><h3 style="color:blue;">Exercises for Parts III and IV</h3>

<p style="color:blue;">12. Take a look at the <a href="https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf">Python Pandas Cheat Sheet</a>. Are there any other summary data you would like to retrieve from this dataset. Try it below:</p></div>

<div class="alert alert-info" role="alert"><p style="color:blue;">13. Now let's open the Film Dialogue dataset. Copy and paste the code from Part III, Numbers 4 and 5 below and then modify that code to open the film dialogue dataset. (Navigate to the film-dialogue folder using the folder directory on the left. You will need to identify the specific file path to and file name of the film dataframe).</p>
</div>

\*\***A note about data entry and humans (and things like gender, race, class, etc.)**: When people collect data, they often have to reduce complex phenomena to individual categories. This is problematic when performed on objects that do not neatly fit into pre-existing categories (or any categories for that matter). This is especially problematic when assigned to humans. How do you reduce humans to a racial or ethnic category when those categories are, to at least some extent, social inventions? When a person comes from a mixed background? How do you reduce people to two gender categories, when many people do not feel they are adequately represented by such categories? Keep this in mind when reviewing datasets that place people into hard and fast categories (as the above dataset does for the gender of movie characters).

<div class="alert alert-info" role="alert" style="color:blue;"><p>Using some of the methods introduced in Part III and the [Python Pandas Cheat Sheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf), answer the following questions:</p>
<p>14. How many rows and columns are there in this dataset?</p>
</div>

<div class="alert alert-info" role="alert" style="color:blue;"></p>
15. What are the range of dates for the films included in this dataset?
</p>
</div>

<div class="alert alert-info" role="alert" style="color:blue;">
<p>16. How many characters are identified as men or women (** - see note above) in this dataset</p>
</div>

<div class="alert alert-info" role="alert" style="color:blue;">
<p>17. What is the average age of film characters in this dataset?</p>
</div>

## V. Subset, Filter, and Sort a Dataframe - an Introduction

Frequently, when we are working with a dataset we may want to subset, filter, or sort the dataset before beginning your analysis.

Many of you may already be familiar with subsetting, filtering, and sorting datasets. But, just for review, this is what we mean by these terms:
+ **subset** - to subset a dataset is to create a smaller version of the dataset. For example, you may want to drop specific columns or rows to create a dataframe for your analysis

    <img src="https://pandas.pydata.org/docs/_images/03_subset_columns.svg" style="width:40%;">
    <img src="https://pandas.pydata.org/docs/_images/03_subset_rows.svg" style="width:40%;">
    
    + **filter** - filtering a dataframe is a specific type of subsetting. For example, you may want to drop all rows in a dataframe that contain missing values. Or keep only those rows that have a value that meets a particular condition.
+ **sort** - arrange the dataframe in a particular order. For example, you may want to sort a dataframe by a year or time column to arrange each row in chronological order. You may also sort a dataframe by multiple columns at once (i.e. sorting by year first and country second).
    <img src = "https://pythonexamples.org/images/python-pandas-dataframe-sort-by-index.svg" style = "width:40%">

Some examples of how you may want to subset, filter, or sort this lesson's dataframes: 

| Modification | World Cup dataset | novels dataset | Film Dialogue dataset |
| :- | :- | :- | :- |
| **subset / filter** | create a df including only the 21st century men's World Cup champions | only review data written in English **or** during a set range of years **or** in the first person | keep only "title", "gender", and "proportion_of_dialogue" columns **or** only include data for film dialogue spoken by a character over 50 years old  |
| **sort** | Sort dataframe by year, but this time in ascending order | sort dataframe by publication data and word length | sort df by proportion of dialogue |



## VI. Filtering a dataframe by a condition

18. To filter a dataframe by a condition, the general procedure is:

`filt_df = df.loc[filter]`

First we need to define the filter itself. For example, a filter searching for a particular value in one column would follow this format:

`filt = (df[colname] == 'value')`

19. We can then assign this specific filter to a variable...

20. And use that filter to create a subset of our original dataset

21. We can also filter by multiple conditions:

In [None]:
years_filt = (novels_df['date'] >= 1890) & (novels_df['date'] < 1900)
novels1890s = novels_df.loc[years_filt]     #note: you cannot begin a variable name with a number, thus 90s_films would raise an error
print(novels1890s.head())
novels1890s.tail()

22. We can also combine multiple filters to create an even smaller and more specific subset:

In [None]:
en_filt = (novels_df['language'] == "English")
novels_en1890s = novels_df.loc[years_filt & en_filt]
novels_en1890s

<div class="alert alert-info" role="alert"><h3 style="color:blue;">Exercise for Part VII. Subsets and Filters</h3>

<p style="color:blue;">23. Filter and subset the film_df to create a new dataset with only films from <b>a decade of your choosing</b>.</p>
</div>

<div class="alert alert-info" role="alert"><h3 style="color:blue;">Exercise for Part VI: Subsetting</h3>

<p style="color:blue;">24. Now, filter the Films dataframe by both two different columns of your choosing (i.e. years between 1990 and 2000 and gross revenue greater than $500 million).</p>
</div>

## VII. Sorting a dataframe

25. We can sort a dataframe by the values in one column or in multiple columns using the **.sort_values()** method.

In [None]:
#novels_df.sort_values(by = 'date')
#novels_df.sort_values(by = 'date', ascending = False)
novels_df.sort_values(by = ['date', 'language', 'length'], ascending = [True, True, False])

<div class="alert alert-info" role="alert"><h3 style="color:blue;">Exercise for Part VII. Sorting</h3>

<p style="color:blue;">26. Sort the films dataframe by two columns of your choosing</p></div>

## VII. Split-Apply-Combine

<img src = "https://www.oreilly.com/api/v2/epubs/9781783985128/files/graphics/5128OS_09_01.jpg" style="width:40%">

A commonly used data analysis strategy is **split-apply-combine**:
+ **split** the problem / data into manageable pieces
+ **apply** some calculations or analysis to the pieces
+ **combine** the parts back into a whole

For computational data science this means (from [Python Pandas documentation](https://pandas.pydata.org/docs/user_guide/groupby.html)):
```
    * Splitting the data into groups based on some criteria
    * Applying a function to each group independently
    * Combining the results into a data structure
```

Let's examine how this strategy may work for the datasets we have been working with. In our novels dataset, for example, we may want to explore larger trends at the continent level. So we could:
+ **split** the overall dataset by continent, creating separate Africa, Asia, North America, South America, Oceania, and Europe datasets.
+ **apply** some calculations to each continental dataset; i.e. What is the average GDP and life expectancy of each continent in different years? Or what has been the average rate of change in these values from one five year interval to the next?
+ **combine** the data back together to compare changes in productivity (as measured by GDP) and health (as measured by life expectancy) by continent and over time.

Different applications apply this strategy in different ways. In Excel, you would use pivot tables, SQL the "group by" operator, and in R you may use the plyr package.

With Pandas, we can use the **.groupby()**, **.agg()**, and **.apply()** functions to do this. 
+ **.groupby()** - *splits* a dataset into separate groups
+ **.agg()** - used to calculate multiple statistics per group in one calculation
+ **.apply()** - applies a function across an entire column (or row) in a Pandas dataframe 

Let's re-examine the contents of our novels dataframe. 

27. Use the .head() method to output its first five rows and use the .columns method to output summary information about this dataframe.

### VIIb: Groupby()

*This first few cells of this section are adapted from Constellate's [3rd Pandas tutorial](https://lab.constellate.org/practical-diabetes-methods/notebooks/tdm-notebooks-2023-04-20T12%3A35%3A07.477Z/pandas-3.ipynb).*

28. Groupby is a powerful function built into Pandas that you can use to summarize your data. Groupby splits the data into different groups on a variable of your choice. 

For example, you can type: `dfname.groupby(['col1', 'col2'])` followed by `.groups` or `.size()` to get summary info about these groups.

To learn more about using Pandas **.groupby()** to split-apply-combine data, see the [documentation here](https://pandas.pydata.org/docs/user_guide/groupby.html).

### VIIc. agg()

We can then use the **.agg()** (aggregate) method to choose what functions we will **apply** to each group to allow us to **combine** the data back together.

The general formula for applying functions using the aggregate (.agg()) method:

```
df_name.groupby('col_name').agg({dict assigning a function to each column we want to aggregate})

```
where the format for each dictionary is as follows:
```
{'name_of_col1_to_keep': 'function_to_apply_to_this_col', 'name_of_col2_to_keep': 'function_to_apply_to_this_col'}
```

Some commonly used functions with groupby ([click here](https://pandas.pydata.org/pandas-docs/stable/reference/groupby.html#computations-descriptive-stats) for a full list):
+ .count(), .mean(), .min(), .max(), .sum()

31. With this in mind, examine the following line of code. What do you guess it does? Run it and see if you were correct?

In [None]:
novels_df.groupby('language').agg({'length':'mean'})

32. We can use the .agg() method to apply multiple functions to multiple columns:

In [None]:
novels_df.groupby(['language', 'gender']).agg({'length':'mean'})

## VIId. .apply() ????

## VIII. Creating Basic Visualizations

We can create simple data visualizations using the **Seaborn** library (which is built on the **matplotlib** library).

In [None]:
import seaborn as sns
sns.barplot(novels_df,  x = "length", y = "language", hue = 'gender')

In [None]:
sns.barplot(novels_df, x = "language", y = "length", hue = "person")

In [None]:
import seaborn.objects as so

so.Plot(novels_df, x = "language", color = "person").add(so.Bar(), so.Count(), so.Stack())

In [None]:
sns.swarmplot(novels_df, x = "date", y = "language", hue = "gender") #"person"

<div class="alert alert-info" role="alert" style="color:blue;"><h3>Exercises Part VII and VIII: Split-Apply-Combine</h3>
<p>33. Now, using what we've learned in this lesson, let's answer the question: What films from each decade had the highest gross while having a female character who had the most dialogue in the film?</p> 

<p> Note: the methods we learned in this session may not be sufficient. Feel free to search online for additional help. The key to effective searches is using the correct words to describe what you are trying to do, such as "python dataframe subset max value from each group".</p>

<p><i>Alternatively, you can try the same with a character over 60 years old, for example.</i></p>

<p>First, identify the steps we will need to do to answer this question in the text cell below:</p>


</div>

Steps:
1. ...
2. 

## IX. Reading multiple text files into a dataframe (if time allows or save for next session)

34. We already experimented with two common ways to create a dataframe: creating one from scratch and reading a csv into a dataframe.

For text analysis, however, we often collections of texts stored in plain text files (.txt). Sometimes it is sufficient to loop through and open all plain text files every time we need to process or analyze them.

However, if we plan to return to the same corpus or collections of texts regularly, it usually makes sense to stores the full texts and various processed versions of those same texts (i.e. divided into lists of words and punctuation known as "tokens") into a dataframe. The code below iterates through 200+ U.S. Presidential State of the Union addresses and creates a dataframe storing the president, year, number of tokens, fulltext, and a list of tokens for each speech.

In [None]:
import glob
sotudir = Path("~/shared/RR-workshop-data/state-of-the-union-dataset/txt").expanduser() 
pathlist = sorted(sotudir.glob('*.txt'))
txtlist = []
for path in pathlist:
    nameparts = path.stem.split("_")
    pres = nameparts[0]
    year = nameparts[1]
    with open(path, encoding = 'utf-8') as f:
        txt = f.read()
    tokens = txt.split()
    numtoks = len(tokens)
    txtlist.append([pres, year, numtoks, txt, tokens])
colnames=['pres','year','numtoks','fulltext', 'tokens']
sotudf=pd.DataFrame(txtlist, columns=colnames)
sotudf = sotudf.sort_values(by = "year")
sotudf.head()


## Appendix I: Other ways to access parts of a dataframe

8. At the moment, rows are only indexed by numbers ("0", "1", "2" and so on). However, we can convert out "Year" column into an index:

`newdf = df.set_index("colname")`

In [None]:
wcup = wcup.set_index("Year")


9. Output the head of this dataframe (`df.head()`). Notice the slight change (The lower placement of "Year" indicates it is now an index rather than a column):

In [None]:
wcup.head(3)

10. Now we can search the dataframe by our index ("Year") and by column name. For example:

In [None]:
wcup.loc[1994, "Champion"]

In [None]:
wcup.loc[2022:2010, 'Champion']

**Note: for the code above, our index ("Year") has integers not strings. So running:*

```
wcup.loc["1994", "Champion"]
```

*will produce an error. Since 1994 is considered an integer here, we should leave the quotes out.*

11. Like many things in Python, there are often multiple ways to accomplish the same goal. For example, we can also select particular columns by simply placing the names of a column within brackets (similar to indexing and slicing lists in Python).

In [None]:
# Use bracket notation to access the column 'Champion'
wcup['Champion']

## Appendix II: Subsetting Dataframes by dropping specific columns or rows

14. We can drop specific columns using the **.drop()** method for Pandas dataframes. For example, if we want to drop the "Host" column from our World Cup dataset, we could run the following:

In [None]:
print(novels_df.shape)
print(novels_df.columns)
novels_df2 = novels_df.drop(columns = ['filename', 'id'])  #note: we need to add "wcup2 =" to save the changes to our original wcup dataframe as "wcup2"
                                    ##you could also just replace the original "wcup" by writing wcup = wcup.drop(...)
                                    ## however, often it is helpful to keep both the original, full dataframe and the smaller, subsetted one
print(novels_df2.shape)
print(novels_df2.columns)

There are many different ways to drop specific columns or rows using .drop(). See the [Pandas 2 Constellate lesson](https://lab.constellate.org/monist-language/notebooks/tdm-notebooks-2023-04-19T18%3A59%3A16.317Z/pandas-2.ipynb) for some additional examples.

15. Commonly, for example, we may want to drop observations (rows) containing null data (warning: you should always consider what this removal does to the representative nature of your dataset!). We can use the **.dropna()** method. See some examples below:

In [None]:
print(film_df.shape)
film_df_no_nas = film_df.dropna()
print(film_df_no_nas.shape)

15b. Temporarily or permanently changing a dataframe: Most methods in Python just temporarily change a dataframe unless you assign the modified dataframe to a variable. For example:

```
film_df.dropna()
```
just outputs a version of film_df with all NAs and null values removed, but does not save it to memory. If we wanted to replace the original "film_df" we can simply assign it the same name as follows:

```
film_df = film_df.dropna()
```
Or we can save it under a new name as we did above:
```
film_df_no_nas = film_df.dropna()
```

Finally, we can also use the "inplace" argument to do the same:

```
film_df.dropna(inplace = True)
```



15c. You can also drop columns with missing values (this is much rarer than dropping rows with missing values):

In [None]:
film_df_dropcols = film_df.dropna(axis = 1)    #for Pandas, rows are axis 0 and columns are axis 1
#film_df_dropcols = film_df.dropna(axis = "columns") # this does the same thing! 

15d. More commonly, you may want to remove rows that are missing values for specific columns. For example, if you want to analyze the age of characters / actors given speaking roles in films, you would want to remove any characters (rows) for which we lack age data.

In [None]:
print(film_df.shape)
film_df_age = film_df.dropna(subset = "age")
film_df_age