# Denison CS-181/DA-210 Homework

---


# Mid-Semester Project: Movies and Actors

## Preliminaries

### Objectives

The objectives of this project include:

- Working with a real-world data set, and one that has greater volume and scale than we have seen,
- Going beyond the "building-block" mentality that comes with the small data and limited scope of homework sets,
- Building a larger whole, synthesized from many different skills and analysis learned so far over the semester,
- Thinking about the data itself, and the information the study of that data can provide, and
- Effectively comminicating what was learned.

A secondary objective is to give students a "preliminary run" for a synthesis project like this, so as to be better prepared for the final project at the end of the semester.

This preliminary run will give more structure and guidance than you will see specified in the final project.  But even within this project, you will have some self-determination to explore the data set and use it to ask questions that are interesting to you entailed by the data.

It is *vitally important* that you realize early on that this project will be evaluated on **how** you do your work, and on **how well** you communicate.  A project that just hacks together some monolithic or ill-structured code to "get an answer", or does a poor job at writing the essay that communicates what is learned will not receive a good grade.

### The Data

The data for this project comes from The Movie Database (TMDb), https://www.themoviedb.org.  This provider has an API that we will explore in more detail in the last four weeks of the class, but for now we have used this API to acquire data about the "most popular" movies since 2010.  The popularity metric is one invented and maintained by TMDb and incorporates a number of factors to get the popularity scaler (float) value.

Given the popular movies, we also used the API to obtain information about the actresses and actors that are billed as the top six actors for each movie.  We then acquired some additional data about these actors, to get things like the actor's popularity and other actor-specific data.

We have translated the acquired data into an XML format, and this will serve as your starting point for this project.

Below, we will subdivide our guidance regarding processing and information content based on two categories of XML source data files:

1. There is one XML file about the movies,
2. There are a set of ten XML files about the actors and the movies (from the above set) they have appeared in.

The XML data is all in the `moviedata` folder in the class repository.  Thus, by doing a `resolvedir()` from the `util` module, you can set up a `datadir` variable with the path to the folder containing your data.  After this description, we include the normal prologue cell to get the `util` module into your environment.

### The Process

Modeling the process that would occur in a data analysis pipeline, your overall phases in this project will include:

1. Use the XML data source to **understand the data**, which, as we know, means to determine what are values and what are variables and what are the independent variables and dependent variables, and what are the set of **functional dependencies** entailed by the data.

2. **Build functions** and process the data with a goal of creating a **set** of tidy data tables for the data.  The functions will use our XML and XPath operations, as well as manipulating the data frames in `pandas` to obtain the desired tidy data.  Once we have tidy data, we will write out a set of CSV files that can then serve as an ouput product of these first two phases.

3. Within our Python processing, we will also do some preliminary exploration of the data, asking you to perform operations that give you a sense of the data, and to build a basic histogram and a scatter plot to help understand the data distribution.

4. You will craft two or three **interesting questions** that deeper exploration of the data can help answer, and design visualizations that can present the data in a manner that helps answer those questions.

5. Communicate, through an essay written for a **non-expert audience** that sets the data context, develops the questions, presents the visualizations, interprets what is learned from the data and visualizations, and concludes.

Phases 1 and 2, we will repeat twice, once for the movies XML source, and then again for the set of files giving the actor-movie XML source files.  This set of steps are part of the current notebook.  Phase 3 is detailed in a separate notebook, `movie_explore.ipynb`, which you will also turn in.

Phases 4 and 5 are the self-determined part of this project.  We will describe the requirements here, but you perform these phases independently of this notebook, and will use Tableau or another mechanism for creating your visualizations, and then will produce your essay and turn in a PDF of the final product.  In getting to that final product, you may choose to use your own Python notebook that consists primarily of Markdown (I would not expect code in such an essay notebook), but you could use Google docs, Microsoft Word, LaTeX, or any other document creation environment.  You turn in a PDF.

**Summarizing the deliverables**

1. The first deliverable is the current notebook, which encompasses phases 1, and 2.  It should be executable by me and by our TAs, and should not be dependent on your own file system and its paths.
2. The set of CSV files corresponding to the tidy data result of your processing in this notebook.
3. The `movie_explore.ipynb` is another notebook deliverable.
3. A PDF of your essay comminicating what you learned in the project.

### Grading

The project will be graded out of 50 points, split evenly with 25 points for assessment of this notebook (and its output), and 25 points for assessment of your essay and the work entailed.

#### Assessment of the Notebooks

- Correctness of the Functional Dependencies determined for the source data,
- Correctness of code,
- Appropriate design and creation of functions, including use of appropriate parameters,
- **Use** of those functions in higher level functions,
- Avoidance of global code or global variables outside of functions,
- Answers in markdown and code solution cells,
- Code documentation:
    - docstrings
    - inline comments
- Correct output (the CSVs)

#### Assessment of the Essay

- Presence of two or three interesting questions that have sufficient depth and can be answered by the data,
- **Good** visualizations that help understand the data and answer the questions.
    - Must avoid being too busy, or in other ways inappropriate to what can be well conveyed in a figure in an essay,
    - Must be appropriated titled, with axis labels and clear units,
    - Must allow the reader to compare data and interpret results for themselves.
- Essay must be well written, including:
    - Good use of headers and structures, giving it a clear organization,
    - Good grammar, punctuation, use of terms,
    - Set a background/context/introduction,
    - **Develop** the questions to be answered,
    - Present the visualizations and describe what the reader is seeing and help **interpret** the results,
    - Hits the target audience of a non-expert.

### Transition to Development Phases

> The remainder of this notebook helps guide you through the first two phases.  As noted above, we will progress through:
> - Phase 1 and 2 for the `movies.xml` source,
> - Phase 1 and 2 for the set of `actormovie` XML files,
> 
> Phase 3 on basic exploration will be part of a separate notebook.
>
> Phases 4 and 5 are on your own, but should adhere to the requirements set forth above.

In [5]:
import os
import sys
import lxml
import pandas as pd
from lxml import etree

def add_modules():
    """
    Starting at the current directory and proceeding up the file system
    tree, search for a directory named `modules`.  If found, and if not
    already there, add to the Python module search path.
    
    Params: None
    
    Return: None
    """
    directory = "."
    levels = 0
    while not os.path.isdir(os.path.join(directory, "modules")) and \
          levels < 5:
        directory = os.path.join(directory, "..")
        levels += 1
    module_path = os.path.abspath(os.path.join(directory, "modules"))
    if os.path.isdir(module_path):
        if not module_path in sys.path:
            sys.path.append(module_path)

add_modules()
import util

## Movies Development

> Note that in the following development, we are expecting you to do more for yourself.  So the prologue cell includes the basic imports, but does not include anything further.

### Acquiring the XML & Understanding the Data

**Step 1** In the following cell, define a function to obtain the XML tree from `movies.xml` in the `moviesdata` folder.  Then **use that function** to actually obtain the tree and assign to `mroot`.

If you are unsure about what makes a good functional abstraction here, you can use the answer to your question in homework 3.2.  But in either case, reflect about the design of the function and its parameters: why were *these* chosen as parameters; why we would want a function to perform these steps (as opposed to just creating the tree?  Put the answers to these last questions in the markdown following the code cells.

In [6]:
# Solution cell

datadir = util.resolve_dir("moviedata")
parser0 = etree.XMLParser(remove_blank_text=True)

def movieDataXML(filename, parser=None):
    '''
    This function parses through an xml file and returns the
    root
    
    Parameters: filename: the name of the xml file
                parser: the parser used on the file,
                if there is none then it will use
                the standard parser
                
    Return: root: the root of the xml file
            None: if there was not a file with that filename
            or the parser did not work on the file
    '''
    if(os.path.isfile(os.path.join(datadir, filename))):
        path = os.path.join(datadir, filename)
        if(parser==None):
            tree = etree.parse(path, etree.XMLParser())
        else:
            try:
                tree = etree.parse(path, parser)
            except Exception as e:
                return None
        root = tree.getroot()
        return root
    return None

mroot = movieDataXML("movies.xml", parser0)

In [7]:
assert True

The function has parser and filename included in the parameters, but no directory as it expects the data to come from the moviesdata folder by default. Additionally, things such as the try indent and if not statements are there in case minor issues arise when the function is operating with nothing there to gather.

**Step 2** Use the following code cell to gather information from the XML to help start understanding the data.  You are welcome to use as many code cells as you wish to demonstrate this discovery.

These steps could use XML procedural operations and XPath to find results.  Some exploratory steps and questions might include:

- Printing a prefix of the tree to see the top level structure,
- Finding the number of movies,
- Determining the attributes and the set of childeren of a movie Element,
    - do all movie elements have all of the same attributes and/or children?
- Are any attributes or children problematic when we think about tidy data?

You may use the above, but also come up with your own.  Then, in the markdown cell following the code cell, describe what you learned about the data.  You should even write down a preliminary functional dependency (or two), depending on what you discovered.

In [8]:
# Solution cell

prefix = mroot.xpath('/movies')
prefix

[<Element movies at 0x7fa757bc8380>]

In [9]:
movies = len(mroot.xpath('/movies/movie'))
movies

64

In [10]:
attrib = mroot.xpath('/movies/movie[1]/@*')
attrib

['527774', 'Raya and the Last Dragon']

In [11]:
children = mroot.xpath('/movies/movie[1]//text()')
children

['Animation',
 'Adventure',
 'Fantasy',
 'Family',
 'Action',
 'Long ago, in the fantasy world of Kumandra, humans and dragons lived together in harmony. But when an evil force threatened the land, the dragons sacrificed themselves to save humanity. Now, 500 years later, that same evil has returned and it’s up to a lone warrior, Raya, to track down the legendary last dragon to restore the fractured land and its divided people.',
 '5726.502',
 '2021-03-03',
 '8.5',
 '1154']

In [12]:
amount_genre = len(mroot.xpath('/movies/movie/genre/..'))
amount_genre

64

In [13]:
amount_des = len(mroot.xpath('/movies/movie/overview/..'))
amount_des

64

There are 64 movies, each of them have at least one genre, overview, release data, popularity, vote count, and vote average. They also have two attributes, and id and title. Some films have more than one genre which will be probalematic when it comes to tidy data.

**Step 3** There are a number of numeric values within the data set.  Part of understanding the data is, for numeric data, to known the minimum, maximum, and mean for a set of numbers.  Since we want to avoid repeated and/or copy and paste code, let us write a function to calculate these values for a given child of movie.  Write a function:

    childStats(root, child, conversion=float)
    
where `root` is the root of a movies tree, child is the tag for a particular child, and conversion determines what conversion should be performed on the string/textual data, with a conversion to a float value being the default.  The function obtains a list of the text values for the specified child, converts the list to numeric values using the specified conversion, and then calculates and returns the min, mean, and max from the set.

After you have defined the function, invoke the function appropiately on each of the numeric children.  In the markdown cell that follows, describe what you learned.

In [14]:
def childStats(root, child, conversion=float):
    '''
    This function changes gives you the min, max
    and mean of all the certain, numerical children
    of an xml tree.
    
    Parameters: root: the root of the tree
                child: the xpath to the child
                conversion: the conversion you want
                but it is automatically float
                
    Return: min of the child, max of the child, and
            mean of the child.
    '''
    child = root.xpath(child)
    child = [conversion(val) for val in child]
    return min(child), max(child), sum(child)/len(child)

popularity = childStats(mroot, "/movies/movie/popularity/text()")
vote_count = childStats(mroot, "/movies/movie/vote_count/text()")
vote_average = childStats(mroot, "/movies/movie/vote_average/text()")

print(popularity, vote_count, vote_average)

(282.24, 5726.502, 750.7537812500002) (0.0, 21223.0, 2580.546875) (0.0, 9.6, 6.739062500000001)


In [15]:
assert True

We learned how to convert a string into a different data type (in this case a float) with a converstion parameter. We also learned how to get the minimum, maximum, and the mean of a list and return all of those at the same time, putting them into a tuple.

### Building Tabular Data from Movies XML

XML is meant to provide flexible organization and to transmit data. While it is the best tool for that job, it cannot handle advanced queries and analysis like groupby and aggregation. Thus, we often need to traverse the XML file and build two-dimensional subsets of the data into pandas data frames in order to do any further analysis.

#### Movie Table

Including the set of genres in a movies table that is supposed to be tidy is problematic.  With genres, we would have a **set** or collection of values that would go inside a table entry for a particular movie.  Then we are violating the idea of a column being exactly one variable, since this column would have a value that is really some kind of multivalue, which is an extension of the idea of a mashup.  We need the value of exactly one variable to be exactly one value.  

We will address this a few steps hence.  For now we focus on the tidy portion of a table that has a row per movie, and columns for each of the dependent variables, omitting genres.

The best way to solve a problem like this is to break it into more manageable pieces.  We will first define a function that can process a single XML `movie` Element and yield exactly one row, represented as either a listm, or as a dictionary. (We recommend reprsenting a row as a dictionary, like we did in class and in the textbook when processing XML.)

If we have successfully defined such a function, we can then put it into a higher level function that can use it repeatedly on the set of `movie` Elements in the tree.  This builds a collection of rows that we can then use to create a pandas data frame.

**Step 1** Decide on the columns (and column names) for your movies table.  Also determine the appropriate data type for each column.  With your design clearly in mind, write a function:

    genMovieRow(movieElement)
    
that constructs and returns a single row of data about an individual movie based on a movieElement.  The type of the return result should be either a list or a dictionary.  In the dictionary case, the keys should be your column names and they should map to the values.  Regardless of whether you return a list or a dictionary, the data types of the values should be correct.

In the end of the solution cell, we show an example invocation of your function on the first movie in the collection.  You should test your function on additional movie elements from the movies tree.

In [16]:
def genMovieRow(movieElement):
    '''
    This function gets the row of a movie element in the 
    xml file "movies.xml", excluding the genres.
    
    Parameters: movieElement: the movie element that will have
                its row returned in a dictionary.
    
    Return: D: the dictionary with the row of the movie element
    '''
    D = {'title':movieElement.attrib['title'], 'movieid':movieElement.attrib['id']}
    for child in ['overview', 'popularity', 'release_date', 'vote_average', 'vote_count']:
        value = movieElement.find(child).text
        D[child] = value
    return D
    
firstmovie = mroot.xpath("//movies/movie[1]")[0]
row = genMovieRow(firstmovie)
print(row)

{'title': 'Raya and the Last Dragon', 'movieid': '527774', 'overview': 'Long ago, in the fantasy world of Kumandra, humans and dragons lived together in harmony. But when an evil force threatened the land, the dragons sacrificed themselves to save humanity. Now, 500 years later, that same evil has returned and it’s up to a lone warrior, Raya, to track down the legendary last dragon to restore the fractured land and its divided people.', 'popularity': '5726.502', 'release_date': '2021-03-03', 'vote_average': '8.5', 'vote_count': '1154'}


In [17]:
assert True

**Step 2** Using the function from the previous question, write a function

    genMoviesTable(movies_root)
    
that iterates over the movie elements, accumulating a list of the rows for the table.  Your function should then create a valid data frame representing the data.  The column Index of the returned data frame should have meaningful names.

In [18]:
def genMoviesTable(movies_root):
    '''
    This function will return all of the movie rows as a 
    DataFrame in the file "movies.xml", excluding the genres.
    
    Parameters: movies_root: the root of the xml file.
    
    Return: table: a DataFrame of the different movies in the
            xml file, without the genres.
    '''
    collist = movies_root.xpath('/movies/movie')
    L = []
    for movie in collist:
        L.append(genMovieRow(movie))
    table = pd.DataFrame(L)
    return table
    

movies = genMoviesTable(mroot)
movies.head(10)

Unnamed: 0,title,movieid,overview,popularity,release_date,vote_average,vote_count
0,Raya and the Last Dragon,527774,"Long ago, in the fantasy world of Kumandra, hu...",5726.502,2021-03-03,8.5,1154
1,Tom & Jerry,587807,Tom the cat and Jerry the mouse get kicked out...,3172.173,2021-02-11,7.6,814
2,Coming 2 America,484718,Prince Akeem Joffer is set to become King of Z...,2504.228,2021-03-05,7.1,775
3,Monster Hunter,458576,A portal transports Cpt. Artemis and an elite ...,2333.41,2020-12-03,7.3,1108
4,Wonder Woman 1984,464052,A botched store robbery places Wonder Woman in...,2010.4,2020-12-16,6.9,4229
5,The Little Things,602269,"Deputy Sheriff Joe ""Deke"" Deacon joins forces ...",1229.877,2021-01-28,6.5,562
6,Outside the Wire,775996,"In the near future, a drone pilot is sent into...",1189.271,2021-01-15,6.5,840
7,Wrong Turn,630586,Jen and a group of friends set out to hike the...,1140.524,2021-01-26,6.4,238
8,Black Water: Abyss,522444,An adventure-loving couple convince their frie...,1119.721,2020-07-09,5.1,156
9,Breach,651571,A hardened mechanic must stay awake and mainta...,1022.12,2020-12-17,4.5,297


In [19]:
assert True

#### Genres of Movies

When we have a case of multiple values for a single variable, like we do for genres, we can solve our problem by creating a separate and distinct table that represents the pairings between a movie and its genres.  In such a table that would be a row for each **combination** of movie and its genre.    We will call this our `movie_genre` table.

There will be two columns in this table: movie id, and genre name.  This means that processing a single movie Element from our XML would generate a **list** of rows for `movie_genre`.

We should be able to break our overall problem by extending our two-step solution for `movies`, adapting for this extension.  So define one function that processes a movie Element and produces a list of rows.  Then **use** this function in a higher level function that repeats the process for each of the movie Elements in an entire tree, accumulating the complete set of rows, and then creating the data frame.  Call the "higher level" function `genMovieGenres()`, and be sure it returns a pandas data frame.

**Step 1 and 2**

In [20]:
def genGenres(movieElement):
    '''
    This function creates a row of genres of a movie in the
    xml file "movies.xml".
    
    Parameters: movieElement: the movie element that will have
                its genres generated in a row.
                
    Return: D: the dictionary with the genres of a movie element.
    '''
    L = []
    Gen = []
    L.append(movieElement.attrib['id'])
    for child in movieElement.findall('genre'):
        Gen.append(child.text)
    L.append(Gen)
    return L

def genMovieGenres(movies_root):
    '''
    This function creates a DataFrame of the movie ids and genres
    of the movie in the "movies.xml" file.
    
    Parameters: movies_root: the root of the xml file to get the
                movie genres from.
                
    Return: DataFrame: a DataFrame of the movie ids and genres of
            the movies in the xml file.
    '''
    collist = movies_root.xpath("""/movies/movie""")
    L = []
    for item in collist:
        L.append(genGenres(item))
    table = pd.DataFrame(L, columns = ['movieid', 'genre'])
    table = table.explode('genre')
    return table

movie_genres = genMovieGenres(mroot)
movie_genres.head(15)

Unnamed: 0,movieid,genre
0,527774,Animation
0,527774,Adventure
0,527774,Fantasy
0,527774,Family
0,527774,Action
1,587807,Action
1,587807,Comedy
1,587807,Family
1,587807,Animation
1,587807,Adventure


In [21]:
assert True

#### Bringing it Together with Output

We still have to take the step of creating the CSV files for our tidy data tables.  In addition, we need to avoid the bad and error prone practice of putting code at the global level (outside of functions). We should have a function that brings together the steps of:

1. Constructing the movies tree from the XML file in a data directory,
2. Using the root of that tree to invoke our `genMoviesTable()` and get a data frame,
3. Use the root of that tree to invoke our `getMovieGenres()` function and get another data frame,
4. Write the output from each of the two tables to CSV.  You should do the "right thing" with respect to index on the data frame.  If your index is the default integer indes, you do not want that index to be a column in your CSV.  If your index is meaningful and carries data, then you do want that index to be a column in your CSV.

Write the function:

    processMovies(datadir, infilename, 
                  moviesfile, moviegenrefile)
    
that uses `infilename` in `datadir` as the assumed movies XML file, and performs the steps above, putting the CSV of the movies table in `moviesfile` and the CSV of the movie genre table in `moviegenrefile`.  You can assume that output should be to the current directory.

The function should not rely on **any** global variables.  It has no value to return, but the docstring should be explicit on its input and output.

In [22]:
# Solution cell

def processMovies(datadir, infilename, moviesfile, moviegenrefile):
    '''
    This function makes csv files of an xml file that contains
    movie information. One csv file is has the movie titles and
    information other than the genres, the other has the genres
    and movie ids.
    
    Parameters: datadir: the directory of the xml file and where
                the csv files will be saved.
                infilename: the name of the xml file with the movie
                information.
                moviesfile: the name of the csv file with the movie
                information other than the genres.
                moviegenrefile: the name of the csv file with the 
                movie genres and ids.
                
    Return: None
    '''
    parser0 = etree.XMLParser(remove_blank_text=True)
    mroot = movieDataXML(infilename)
    movies = genMoviesTable(mroot)
    genre = genMovieGenres(mroot)
    movies.to_csv(path_or_buf=os.path.join(datadir, moviesfile),index=False)
    genre.to_csv(path_or_buf=os.path.join(datadir, moviegenrefile),index=False)
    

# Invoking our top level function

datadir = util.resolve_dir("moviedata")
assert datadir is not None

processMovies(datadir, "movies.xml",
              "movies.csv", "moviegenre.csv")

In [23]:
assert True

## Actors and Actor-Movie Development

Our desired analysis of movies may well extend beyond the information entailed by the `movies.xml` file and the derivative tables of the movies and the genres of movies.  Often, we are interested in the actresses and actors that appear in these movies.  A given actress/actor may appear in multiple of the movies in our data set.  For each movie the actress/actor appears in, there is the dependent information of the character that they portray. 

Further, the actor has their own dependent information, including their name, their birth date, their place of birth, and their popularity.

The structure of processing the XML for actors and their movies follows the same basic structure as we did for movies.  This time, you are not give prescribed functions, but you **are** expected to continue designing and then using functions to build and then output the tidy tables inherent in this second data set.

One wrinkle in this processing is that the actor-movie data is not contained in a single XML file but, rather, it is spread over **ten** source files.  This multiple file or multiple "chunks" containing data of interest is a common one in data systems.  We must be able to process each file to gather and organize its data, but we must also be able to build up aggregate data out of these multiple files or chunks.  

Looking at the data in the `moviesdata` directory, we can see that the name of these files share a common prefix, and then use an integer value following the prefix (and before the extension) to distinguish the different files.  

### Acquiring one of the XML Files & Understanding the Data

**Step 1** From your work on the movies XML and table, you should already have available a function to read and parse an XML.  Use this function to assign to am0_root (for actor-movie 0) and get the root of the tree for `actormovie0.xml`.

In [24]:
am0root = movieDataXML("actormovie0.xml", parser0)
am0root

<Element results at 0x7fa757c51e00>

In [25]:
assert True

**Step 2** Use the following code cell to gather information from the XML to help start understanding the data.  You are welcome to use as many code cells as you wish to demonstrate this discovery.

These steps could use XML procedural operations and XPath to find results.  Some exploratory steps and questions might include:

- Printing a prefix of the tree to see the top level structure,
- Finding the number of actor-movie combinations in this XML file/tree.
- Determining the attributes and the set of childeren of an actormovie Element,
    - do all actormovie elements have all of the same attributes and/or children?  Be careful here, some children are **not** present in all of these elements.
    - other than through omission of a child, are there any other children whose text value is empty?
- Are there aspects that are problematic when we think about tidy data?
- Can you use your `childStats()` to get a sense of the `popularity`  child?  Why or why not?  If not, can you change the function to allow it to be applicable in this case too?

You may use the above list, but also come up with your own.  Then, in the markdown cell following the code cell, describe what you learned about the data.  You should write down your functional dependency(ies) entailed by the data.

In [26]:
prefix = am0root.xpath('/results')
prefix

[<Element results at 0x7fa757c51e00>]

In [27]:
actorsmov = len(am0root.xpath('/results/actormovies/actormovie'))
actorsmov

40

In [28]:
attrib = am0root.xpath('/results/actormovies/actormovie[1]/@*')
attrib

['1663195', '527774']

In [29]:
popularity = childStats(am0root, "/results/actormovies/actormovie/popularity/text()")
popularity

(1.611, 52.843999999999994, 11.533274999999996)

The biggest problem with this data is that some people have death dates while others do not. There is also two functional dependencies, actorid and movieid.

### Building Tabular Data from ActorMovies XML

Hopefully, you have recognized more than one functional dependency entailed in the data.  If you have not, then go back and look more carefully at the data and what consitutes independent and dependent variables.  Refine your FDs in the last question accordingly.

When we processed the movies XML, we were able to omit the genres children of a movie and then independently build two tables that, once built, were in a tidy form.

But sometimes, making data tidy is more easily accomplished using the power of pandas and the techniques we learned in the tabular unit.  In this case, an alternative strategy would be to build a table from the data, but with the understanding that it is **not** tidy.  But we then use pandas operations to make it tidy afterward.  That is the strategy we will employ here.

While we will give you code cells (below) for you to use for the steps as we describe them here, we leave it to you to define the names of the functions and the parameters for the functions.  You can also choose a different solution strategy.  But you **will** be assessed on your functional design and you should be able to defend an alternate design strategy.

#### ActorMovie Table (non-Tidy version)

As noted above, we have multiple files that contain actor-movie XML.  When we adapt what we learned in processing movies, we can define the following multistep process:

1. Determine the aggregate set of columns and define a function for generating and returning a table row that is based on the children of an `actormovie` Element.  We call this `genActorMovieRow()`.
2. Build a higher level function that, given the root of an XML tree from a single file, uses `genActorMovieRow()` repeatedly on the `actormovie` Elements in the tree, and returns a DataFrame.  We call this `genActorMoviesTable()`
3. Our top level function (and the analog of `processMovies()` from the prior development) has some additional complexity.  Given a data directory and a **filename prefix** and a **number of files**, the function can, repeatedly,
    - generate a file name for one of the files,
    - read and parse the XML into a tree,
    - generate a data frame using `genActorMoviesTable()`
    - accumulate this data frame inbto an aggregate by concatenating in the row dimension.

#### Separate Actor and ActorMovie Tables (Tidy versions)

If the top level function described in (3) above returns the resultant data frame, we can complete the work by writing a function that takes that non-tidy data frame as a parameter and then:

1. Splits the non-tidy data frame into two parts
2. Cleans (by removing duplicates) one of the resultant tables,
3. Writes the two data frames to CSV.

This function should have parameters appropriate to its operation.

One could also envision a solution which combines these last steps into the top level function from the prior part.

--------------

##### Code for Generating a Single ActorMovie Row

> You should include code to test your function on some select actor-movie Elements.

In [30]:
def genActorMovieRow(actorElement):
    '''
    This function generates an actore movie row with
    an actor element from an xml file.
    
    Parameters: actorElement: an actor element from an
                xml file
    
    Return: D: a dictionary with the row for an actor
    '''
    D = {'actor_id':actorElement.attrib['actor_id']}
    for child in ['name','birthday','deathday', 'place_of_birth', 'popularity']:
        try:
            value = actorElement.find(child).text
        except:
            value = None
        D[child] = value
    return D
    
firstactor = am0root.xpath("//results/actormovies/actormovie[1]")[0]
row = genActorMovieRow(firstactor)
print(row)

{'actor_id': '1663195', 'name': 'Kelly Marie Tran', 'birthday': '1989-01-17', 'deathday': None, 'place_of_birth': 'San Diego, California, USA', 'popularity': '11.484000000000002'}


-----

##### Code for Processing a Actor-Movie Tree

> You should include code to test your function on one or more actor-movie trees.

In [31]:
def genActorMoviesTable(actor_root):
    '''
    This function generates a table of movie actors with
    a root to an xml file.
    
    Parameters: actor_root: the root of the xml tree that
                will have its actors put into a table
                
    Return: table: a table of the actors in the xml file
    '''
    collist = actor_root.xpath('/results/actormovies/actormovie')
    L = []
    for actor in collist:
        L.append(genActorMovieRow(actor))
    table = pd.DataFrame(L)
    return table

actors = genActorMoviesTable(am0root)
actors.head(10)

Unnamed: 0,actor_id,name,birthday,deathday,place_of_birth,popularity
0,1663195,Kelly Marie Tran,1989-01-17,,"San Diego, California, USA",11.484000000000002
1,1663195,Kelly Marie Tran,1989-01-17,,"San Diego, California, USA",11.484000000000002
2,1625558,Awkwafina,1989-06-02,,"New York City, New York, USA",12.714
3,1625558,Awkwafina,1989-06-02,,"New York City, New York, USA",12.714
4,2362044,Izaac Wang,2007-10-22,,"Minnesota, USA",8.765
5,97576,Gemma Chan,1982-11-29,,"London, England, UK",7.375
6,18307,Daniel Dae Kim,1968-08-09,,"Busan, South Korea",9.757
7,30082,Benedict Wong,1971-06-03,,"Eccles, Greater Manchester, England, U.K",8.185
8,13620,William Hanna,1910-07-14,2001-03-22,"Melrose, New Mexico, USA",7.737999999999999
9,33923,Mel Blanc,1908-05-30,1989-07-10,"San Francisco, California, USA",2.531


----

##### Code for Processing a Set of Actor-Movies XML Files

> Show some testing of your result

In [32]:
def multActorsTable(prefix, num):
    '''
    This function will generate one actor table for 
    multiple actor files.
    
    Parameters: prefix: the prefix of the xml file before
                the file number
                num: the number of xml actor files
                
    Return: table: a table of all the actors in the different
            xml files (without repeats)
    '''
    xmlList = []
    for i in range(num):
        filename = prefix + str(i) + '.xml'
        xmlList.append(movieDataXML(filename, parser0))

    dataTable = []
    for item in xmlList:
        dataTable.append(genActorMoviesTable(item))
        table = pd.concat(dataTable)
    table = table.drop_duplicates()
    return table.reset_index().drop(columns =['index'])

actortable = multActorsTable("actormovie", 10)
actortable

Unnamed: 0,actor_id,name,birthday,deathday,place_of_birth,popularity
0,1663195,Kelly Marie Tran,1989-01-17,,"San Diego, California, USA",11.484000000000002
1,1625558,Awkwafina,1989-06-02,,"New York City, New York, USA",12.714
2,2362044,Izaac Wang,2007-10-22,,"Minnesota, USA",8.765
3,97576,Gemma Chan,1982-11-29,,"London, England, UK",7.375
4,18307,Daniel Dae Kim,1968-08-09,,"Busan, South Korea",9.757
...,...,...,...,...,...,...
350,175913,Brittany Drisdelle,1988-02-06,,"Hudson, Québec, Canada",1.903
351,1496335,Madeline Harvey,,,,1.38
352,2092553,Paul Zinno,,,,0.6
353,1818534,Nick Walker,,,,0.6


In [33]:
def genCharacterRow(actorElement):
    '''
    This function generates a row of characters
    from an actor element.
    
    Parameters: actorElement: an xml element of an
                actor
    
    Return: D: the dictionary of the character row
    '''
    D = {'actor_id':actorElement.attrib['actor_id'], 'movie_id':actorElement.attrib['movie_id'], 
             'character': actorElement.find('character').text}
    return D
    
def genCharacterTable(actor_root):
    '''
    This function generates a character table from
    a single actor xml file.
    
    Parameters: actor_root: the root of the actor
                xml file
                
    Return: table: the table of the characters in the
            actors xml file
    '''
    collist = actor_root.xpath('/results/actormovies/actormovie')
    L = []
    for actor in collist:
        L.append(genCharacterRow(actor))
    table = pd.DataFrame(L)
    return table

def multCharacterTable(prefix, num):
    '''
    This function creates one character table for
    multiple actor xml files.
    
    Parameters: prefix: the prefix of the actor xml
                files before the file number
                num: the number for xml files
                
    Return: table: the table with all the character 
            information from the xml files
    '''
    xmlList = []
    for i in range(num):
        filename = prefix + str(i) + '.xml'
        xmlList.append(movieDataXML(filename, parser0))

    dataTable = []
    for item in xmlList:
        dataTable.append(genCharacterTable(item))
        table = pd.concat(dataTable)
    table = table.drop_duplicates()
    return table.reset_index().drop(columns =['index'])

chartable = multCharacterTable("actormovie", 10)
chartable

Unnamed: 0,actor_id,movie_id,character
0,1663195,527774,Raya (voice)
1,1663195,529203,Dawn Betterman (voice)
2,1625558,527774,Sisu (voice)
3,1625558,512200,Ming
4,2362044,527774,Boun (voice)
...,...,...,...
379,175913,607383,Priscilla
380,1496335,607383,Alice
381,2092553,607383,Tommy
382,1818534,607383,Paul Wilkinson


------

##### Code for Final Tidying and Output

In [34]:
def processActors(datadir, prefix, num, actorsfile, charactersfile):
    '''
    This function creates two csv files of actor data
    and character data from xml files.
    
    Parameters: datadir: the directory with the xml files and 
                where the csv files will be saved in
                prefix: the prefix of the actor xml files
                num: the number of xml files that will be
                processed
                actorsfile: the name of the actors csv file
                charactersfile: the name of the characters
                csv file
                
    Return: None
    '''
    parser0 = etree.XMLParser(remove_blank_text=True)
    actors = multActorsTable(prefix, num)
    characters = multCharacterTable(prefix, num)
    actors.to_csv(path_or_buf=os.path.join(datadir, actorsfile),index=False)
    characters.to_csv(path_or_buf=os.path.join(datadir, charactersfile),index=False)

processActors(datadir, "actormovie", 10, "actors.csv", "characters.csv")