## Tabular Data

### Tidy Data from Excel

An Excel spreadsheet with some brief information on awards given to movies is available at:

> https://www.gnosis.cx/cleaning/Film_Awards.xlsx

In a more fleshed out case, we might have data for many more years, more types of awards, more associations that grant awards, and so on.  While the organization of this spreadsheet is much like a great many you will encounter "in the wild," it is very little like the tidy data we would rather work with.  In the simple example, only 63 data values occur, and you could probably enter them into the desired structure by hand as quickly as coding the transformations.  However, the point of this exercise is to write programming code that could generalize to larger data sets of similar structure.

<img src="img/Film_Awards.png" alt="Film Awards"/>

__Image: Film Awards Spreadsheet__

Your task in this exercise is to read this data into a single well-normalized data frame, using whichever language and library you are most comfortable with.  Along the way, you will need to remediate whatever data integrity problems you detect.  As examples of issues to look out for:

* The film _1917_ was stored as a number not a string when naïvely entered into a cell.
* The spelling of some values is inconsistent.  Olivia Colman's name is incorrectly transcribed as 'Coleman' in one occurrence.  There is a spacing issue in one value you will need to identify.
* Structurally, an apparent parallel is not really so.  Person names are sometimes listed under the name of the association, but elsewhere under another column.  Film names are sometimes listed under association, other times elsewhere.
* Some column names occur multiple times in the same tabular area.

In thinking about a good data frame organization, think of what the independent and dependent variables are.  In each year, each association awards for each category. These are independent dimensions.  A person name and a film name are slightly tricky since they are not exactly independent, but at the same time some awards are to a film and others to a person.  Moreover, one actor might appear in multiple films in a year (not in this sample data; but do not rule it out).  Likewise, at times multiple films have used the same name at times in film history. Some persons are both director and actor (in either the same or different films).

Once you have a useful data frame, use it to answer these questions in summary reports:

* For each film involved in multiple awards, list the award and year it is associated with.
* For each actor/actress winning multiple awards, list the film and award they are associated with.
* While not occurring in this small data set, sometimes actors/actresses win awards for multiple films (usually in different years).  Make sure your code will handle that situation.
* It is manual work, but you may want to research and add awards given in other years; in particular, adding some data will show actors with awards for multiple films.  Do your other reports correctly summarize the larger data set?

### Tidy Data from SQL

An SQLite database with roughly the same brief information as in the prior spreadsheet is available at:

> https://www.gnosis.cx/cleaning/Film_Awards.sqlite

However, the information in the database version is relatively well normalized and typed.  Also, additional information has been included on a variety of entities included in the spreadsheet.  Only slightly more information is included in this schema than in the spreadsheet, but it should be able to accommodate a large amount of data on films, actors, directors, and awards, and the relationships among those data.

```sql
sqlite> .tables
actor     award     director  org_name
```

As was mentioned in the prior exercise, the same name for a film can be used more than once, even by the same director.  For example  Abel Gance, used the title _J'accuse!_ for both his 1919 and 1938 films with connected subject matter.

```
sqlite> SELECT * FROM director WHERE year < 1950;
Abel Gance|J'accuse!|1919
Abel Gance|J'accuse!|1938
```

Let us look at a selection from the `actor` table, for example.  In this table we have a column `gender` to differentiate beyond name. As of this writing, no transgender actor has been nominated for a major award both before and after a change in gender identity, but this schema allows for that possibility.  In any case, we can use this field to differentiate the "actor" versus "actress" awards that many organizations grant.

```sql
sqlite> .schema actor
CREATE TABLE actor (name TEXT, film TEXT, year INTEGER, gender CHAR(1));

sqlite> SELECT * FROM actor WHERE name="Joaquin Phoenix";
Joaquin Phoenix|Joker|2019|M
Joaquin Phoenix|Walk the Line|2006|M
Joaquin Phoenix|Hotel Rwanda|2004|M
Joaquin Phoenix|Her|2013|M
Joaquin Phoenix|The Master|2013|M
```

The goal in this exercise is to create the same tidy data frame that you created in the prior exercise, and answer the same questions that were asked there.  If some questions can be answered directly with SQL, feel free to take that approach instead.  For this exercise, only consider awards for the years 2017, 2018, and 2019.  Some others are included in an incomplete way, but your reports are for those years.

```sql
sqlite> SELECT * FROM award WHERE winner="Frances McDormand";
Oscar|Best Actress|2017|Frances McDormand
GG|Actress/Drama|2017|Frances McDormand
Oscar|Best Actress|1997|Frances McDormand
```

## Hierarchical Data

The two exercises here deal first with refining the processing of the geographic data that is available in several formats.  The second exercise addresses moving between a key/value and relational model for data representation.

### Exploring Filled Area

Using the United States county data we created tidy data frames that contained the extents of counties as simple cardinal direction limits; we also were provided with the "census area" of each county.  Unfortunately, the data available here does not specifically address water bodies and their sizes, which might occur within counties.

The census data can be found at:

> https://www.gnosis.cx/cleaning/gz_2010_us_050_00_20m.json

> https://www.gnosis.cx/cleaning/gz_2010_us_050_00_20m.kml

> https://www.gnosis.cx/cleaning/gz_2010_us_050_00_20m.zip


In this exercise you will create an additional column in the data frame illustrated in the text to hold the percentage of the "bounding box" of a county that is occupied by the census area.  The trick, of course, is that the surface area enclosed by latitude/longitude corners, is not a simple rectangle, nor even a trapezoid, but rather a portion of a spherical surface.  County shapes themselves are typically not rectangular, and may include discontiguous regions.

To complete this exercise, you may either reason mathematically about this area (the simplifying assumption that the Earth is a sphere is acceptable) or identify appropriate GIS software to do this calculation for you.  The result of your work will be a data frame like that presented in the chapter, but with a column called `"occupied"` that contains 3221 floating point values between 0 and 1.

For extra credit you can investigate or improve a few additional data integrity issues.  The Shapefile in the zip archive is the canonical data provided by the US Census Bureau.  The code we saw in this chapter to process GeoJSON and KML actually produce slightly different results for latitude/longitude locations, at the third decimal place.  Presumably, the independent developer whom I downloaded these conversions from allowed some data error to creep in somehow.  Diagnose which version, if either, matches the original `.shp` file, and try to characterize the reason for and degree of the discrepancy.

For additional extra credit, fix the `kml_county_summary()` function presented in this chapter so that it correctly handles `<MultiGeometry>` county shapes rater than skipping over them.  How often did this problem occur among the 3221 United States counties?

### Create a Relational Model

The key/value data in the DBM restaurant data is organized in a manner that might provide very fast access in Redis or similar systems.  But there is certainly a mismatch with the implicit data model.  Keys have structure in their hierarchy, but it is a finite and shallow hierarchy.  Values may be of several different implicit data types; in particular, ratings are stored as strings, but they really represent sequences of small integer values.  Other fields are simple strings (albeit stored as bytes in the DBM).

The `dbm` module in the shown example uses Python's fallback "dumb DBM" format which does not depend on external drivers like GDBM or LDBM.  For the example with hundreds of records this is quite fast; if you wished to used millions of records, other systems would scale well and are preferred.  This "dumb" format actually consistes of three separate files, but sharing the `keyval.db` prefix; the three are provided as a zip archive.

In [None]:
dbm.whichdb('data/keyval.db')

The "dbm.dumb" format is not necessarily portable to other programming languages.  It is, however, simple enough that you could write an adapter rather easily.  To provide the identical data in a more universal format, a CSV of the identical content is also available:

> https://www.gnosis.cx/cleaning/keyval.zip

> https://www.gnosis.cx/cleaning/keyval.csv

For this assignment you should transform the key/value data in this example into relational tables, using foreign keys where appropriate, and making good decisions about data types.  SQLite is an excellent choice for a database system to target; it is discussed in chapter 1 (*Data Ingesion – Tabular Formats*).  Any other RDBMS is also a good choice if you have administrative access (i.e. table creation rights).  Before transforming the data model, you will need to clean up the inconsistencies in the hierarchical keys that were discussed in this chapter.

The names of restaurants are promised to be distinct; however, for foreign key relationships, you may wish to normalize using a short index number standing for the restaurants uniformly.  The separate ratings should definitely be stored as distinct data items in a relevant table.  To get a feel for more fleshed out data, invent timestamps for the reviews, such that each is mostly distinct.  A real-world data set will generally contain review dates; for the example no specific dates are required, just the form of them.

Although this data is small enough that performance will not be a concern, think about what indices are likely to be useful in a hypothetical version of this data that is thousands or millions of times larger.  Imagine you are running a popular restaurant review service and you want your users to have fast access to their common queries.

Using the relational version of your data model, answer some simple queries, most likely using SQL.

* What restaurant received the most reviews?
* What restaurants received reviews of 10 during a given time period (the relevant range will depend on which dates you chose to populate)?
* What style of cuisine received the highest mean review?

For extra credit, you may go back and write code to answer the same questions using only the key/value data model.

## Other Data Formats

We present here two exercises.  One of them deals with a custom binary format, the other with web scraping.  Not every topic of this chapter is addressed in the exercises, but these two are important domains for practical data science.

### Enhancing the NPY Parser

The binary data we read from the NPY was in the simplest format we could choose.  For this exercise you want to process a somewhat more complex binary file using your own code.  Write a custom function that reads a file into a NumPy array, and test it against several arrays you have serialized using `numpy.save()` or `numpy.savez()`.

Test cases for your function are at the URLs:

> https://www.gnosis.cx/cleaning/students.npy

> https://www.gnosis.cx/cleaning/students.npz

We have not previously looked at the NPZ format, but it is a zip archive of one or more NPY files, allowing both compression and storage of multiple arrays.  Ideally your function will handle both formats, and will determine which type of file you are reading based on the magic string in the first few bytes.  As a first pass, only try to parse the NPY version, then enhance from there.

Using the official readers, we can see that this array adds something the earlier example had not.  Specifically, it stores a `recarray` that combines several data types into each value in the array.  The rules we described earlier in this chapter will actually still suffice, but you have to think about them carefully.  The data we want to match in your reader will be exactly the same as using the official reader.

In [None]:
students = np.load(open('data/students.npy', 'rb'))
print(students)
print("\nDtype:", students.dtype)

When you move on to processing the NPZ format, you can compare again with the official reader.  As mentioned, this might have several arrays inside it, although only one is stored in the example.

In [None]:
arrs = np.load(open('data/students.npz', 'rb'))
print(arrs)
arrs.files

The contents of `arr_0` within the NPZ file is identical to the single array in the NPY.  However, after you have successfully parsed this NPZ file, try creating one or more others that actually do store multiple arrays, and parse those using custom code.  Decide on the best API to use for a function that may need to return either one or several arrays.  For this part of the task, the Python standard library module `zipfile` will be very helpful for you.

There is no reason this exercise has to be performed in Python.  Other programming languages are perfectly well able to read binary data, and the general steps involved will be very similar to those performed in this chapter in the Binary Serialized Data Structures section.  You could, for example, read the data within an NPY file into an R array instead.

### Scraping Web Traffic

The author's web domain, gnosis.cx, has been operating for more than two decades, and retains most of the "Web 0.5" technology and visual style it was first authored with.  One thing the web host provides, as do most others, is reports on traffic at the site (using nearly as ancient styling as that of the domain itself).  You can find the most current reports at:

> https://www.gnosis.cx/stats/

A snapshot of the reports current at the time of this writing are also copied to:

> https://www.gnosis.cx/cleaning/stats/

An image of the report page at the time of writing is below.

<img src="img/gnosis-traffic.png" alt="Traffic report for gnosis.cx" width="50%"/>

__Image: Traffic Report for gnosis.cx__

The weekly table shown is quite long since it goes back to February 2010.  The actual site is a decade older than that, but servers and logging databases were modified, losing older data.  There is also a rather large glitch of almost five years in the middle where traffic shows as zero.  The rather dramatic fall in traffic over the six weeks up to the snapshot reflects a change to using a CDN proxy for DNS and SSL (hence hiding traffic from the actual web host).

Your goal in this exercise is to write a tool to dynamically scrape the data made available in the various tables listing traffic sliced by different time increments and recurring periods (which day of week, which month of year, etc).  As part of this exercise, have your scripts generate less terrible graphs than the one shown in the screen picture (meaningless false perspective in a line graph offends good sensibility, and the apparent negative spike to negative traffic around the start of 2013 is merely inexplicable).

It is a common need to scrape a website similar to these reports.  The pattern of having a regular and infrequently changed structure but updated contents on a daily basis, often reflects a data acquisition requirement.  A script like you will write in this exercise could run on a cronjob or under a similar mechanism, to maintain local copies and revisions of such rolling reports.

## Anomoly Detection

The two exercises in this chapter ask you to look for anomalies first in quantitative data, then in categorical data.

### A Famous Experiment

The Michelson–Morley experiment was an attempt in the late 19th century to detect the existence of the *luminiferous aether*, a widely assumed medium that would carry light waves.  This was the most famous "failed experiment" in the history of physics in that it did not detect what it was looking for—something we now know not to exist at all.  The general idea was to measure the speed of light under different orientations of the equipment relative to the direction of movement of the earth, since relative movement of the ether medium would add or subtract from the speed of the wave.  Yes, it does not work that way under the theory of relativity, but it was a reasonable guess 150 years ago.

Apart from the physics questions, the data set derived by the Michelson-Morley experiment is widely available, including as a sample built into R.  The same data is available at:

> https://www.gnosis.cx/cleaning/morley.dat

Figuring out the format, which is not complex, is a good first step of this exercise (and typical of real data science work).

The specific numbers in this data are measurements of the speed of light in km/s with a zero point of 299,000.  So, for example, the mean measurement in experiment 1 was 299,909 km/s.  Let us look at the data in the R bundle.

In [47]:
%%R -o morley
data(morley)
morley %>%
    group_by(`Expt`) %>%
    summarize(Mean = mean(Speed), Count = max(Run))

`summarise()` ungrouping output (override with `.groups` argument)
[90m# A tibble: 5 x 3[39m
   Expt  Mean Count
  [3m[90m<int>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<int>[39m[23m
[90m1[39m     1  909     20
[90m2[39m     2  856     20
[90m3[39m     3  845     20
[90m4[39m     4  820.    20
[90m5[39m     5  832.    20


In the summary, we just look at the number of runs of each experimental setup, and the mean across that setup.  The raw data has 20 measurements within each setup.

Using whatever programming language and tools you prefer, identify the outliers first within each setup (defined by an `Expt` number) and then within the data collection as a whole.  The hope in the original experiment was that each setup would show a significant difference in central tendency, and indeed their means are somewhat different.  This book and chapter does not explore confidence levels and null hypotheses in any detail, but create a visualization that aids you in gaining visual insight into how much apparent difference exists between the several setups.

If you discard the outliers within each setup, are the differences between setups increased or decreased? Answer with either a visualization or by looking at statistics on the reduced groups.

### Misspelled Words

For this exercise we return to the 25,000 human measurements we have used to illustrate a number of concepts.  However, in this variation of the data set, each row has a person's first name (pulled from the US Social Security Agency list of common first names over the last century; apologies that the names lean Anglocentric because of the past history of US population and immigration trends).

The data set for this exercise can be found at:

> https://www.gnosis.cx/cleaning/humans-names.csv

Unfortunately, our hypothetical data collectors for this data set are simply terrible typists, and they make typos when entering names with alarming frequency.  There are some number of intended names in this data set, but quite a few simple miscodings of those names as well.  The problem is: how do we tell a real name from a typo?

There are a number of ways to measure the similarity of strings, and that provide a clue as to likely typos.  One general class of approach is in terms of *edit distance* between strings. The R package **stringdist**, for example provides Damerau-Levenshtein, Hamming, Levenshtein, and optimal sting alignment, as measures of edit distance.  Less edit-specific fuzzy matching techniques utilize a "bag of n-grams" approach, and include q-gram, cosine distance, and Jaccard distance. Some heuristic metrics like Jaro and Jaro-Winkler are also included in `stringdist` along with the other measures mentioned.  Soundex, soundex variants, and metaphone look for similarity of the sounds of words as pronounced, but are therefore specific to language and even regional dialect.

In a reversal of the more common pattern of Python versus R libraries, Python is the one that scatters string similarity measures over numerous libraries, each including just a few measures.  However, **python-Levenshtein** is a very nice package including most of these measures.  If you want cosine similarity, you may have to use `sklearn.metrics.pairwise` or another module.  For phonetic comparisons, **fonetika** and **soundex** both support multiple languages (but different languages for each; English is in common for almost all packages).

On my personal system, I have a command-line utility called `similarity` that I use to measure how close strings are to each other.  This particular few line script measures Levenshtein distance, but also normalizes it to the length of the longer string.  A short name will have a small numeric measure of distance, even betweeen dissimilar strings, while long strings that are close overall can have a larger measure before normalization (depending on what measure is chosen, but for most of them).  A few examples show this.

In [48]:
%%bash 
similarity David Davin

Levenshtein distance: 1
Similarity ratio: 0.8


In [49]:
%%bash
similarity David Maven

Levenshtein distance: 3
Similarity ratio: 0.4


In [50]:
%%bash
similarity "the quick brown fox jumped" \
           "thee quikc brown fax jumbed"

Levenshtein distance: 5
Similarity ratio: 0.814814814815


For this exercise, your goal is to identify every *genuine* name, and correct all the misspelled ones to the correct canonical spelling.  Keep in mind that sometimes multiple legitimate names are actually close to each other in terms of similarity measures.  However, it is probably reasonable to assume that *rare* spellings are typos, at least if they are also relatively similar to common spellings.  You may use whatever programming language, library, and metric you feel is the most useful for the task.

Reading in the data, we see it is similar to the human measures we have seen before.

In [51]:
names = pd.read_csv('data/humans-names.csv')
names.head()

Unnamed: 0,Name,Height,Weight
0,James,167.089607,64.806216
1,David,181.648633,78.281527
2,Barbara,176.2728,87.767722
3,John,173.270164,81.635672
4,Michael,172.181037,82.760794


It is easy to see that some "names" occur very frequently and others only rarely.  Look at the middling values as well in working on this exercise.

In [52]:
names.Name.value_counts()

Elizabeth    1581
Barbara      1568
Jessica      1547
Jennifer     1534
             ... 
Josep           1
iWlliam         1
Joseeph         1
eJennifer       1
Name: Name, Length: 249, dtype: int64

# Data Quality

For the exercises of this chapter, we first ask you to perform a typical multi-step data cleanup using techniques you have learned.  For the second exercise asks you to try to characterize sample bias in a provided data set using analytic tools this book has addressed (or others of your choosing).

### Data Characterization

For this exercise, you will need to perform a fairly complete set of data cleaning steps.  The focus is on techniques discussed in this chapter, but concepts discussed in other chapters will be needed as well.  Some of these tasks will require skills discussed in later chapters, so skip ahead briefly, as needed, to complete the tasks.

Here we return to the "Brad's House" temperature data, but in its raw form.  The raw data consists of four files, corresponding to the four thermometers that were present.  These files may be found at:

> https://www.gnosis.cx/cleaning/outside.gz<br/>
> https://www.gnosis.cx/cleaning/basement.gz<br/>
> https://www.gnosis.cx/cleaning/livingroom.gz<br/>
> https://www.gnosis.cx/cleaning/lab.gz

The format of these data files is a simple but custom textual format.  You may want to refer back to chapter 1 (*Data Ingestion – Tablar Formats*) and to chapter 3 (*Data Ingestion – Repurposing Data Sources*) for inspiration on parsing the format.  Let us look at a few rows:

In [73]:
%%bash
zcat data/glarp/lab.gz | head -5

2003 07 26 19 28 25.200000
2003 07 26 19 31 25.200000
2003 07 26 19 34 25.300000
2003 07 26 19 37 25.300000
2003 07 26 19 40 25.400000


As you can see, the space separated fields represent the components of a datetime, followed by a temperature reading.  The format itself is consistent for all the files.  However, the specific timestamps recorded in each file is not consistent.  All four data files end on 2004-07-16T15:28:00, and three of them begin on 2003-07-25T16:04:00. Various and different timestamps are missing in each file.  For comparison, we can recall that the full data frame we read with a utility function that performs some cleanup has 171,346 rows.  In contrast, the line counts of the several data files are:

In [74]:
%%bash
for f in data/glarp/*.gz; do 
    echo -n "$f: "
    zcat $f | wc -l 
done

data/glarp/basement.gz: 169516
data/glarp/lab.gz: 168965
data/glarp/livingroom.gz: 169516
data/glarp/outside.gz: 169513


All of the tasks in this exercise are agnostic to the particular programming languages and libraries you decide to use.  The overall goal will be to characterize each of the 685k data point as one of several conceptual categories that we present below.

**Task 1**: Read all four data files into a common data frame.  Moreover, we would like each record to be identified by a proper native timestamp rather than by separated components.  You may wish to refer forward to chapter 7 (*Feature Engineering*) which discusses date/time fields.

**Task 2**: Fill in all missing data points with markers indicating they are explicitly missing.  This will have two slightly different aspects.  There are some implied timestamps that do not exist in any of the data files.  Our goal is to have 3 minute increments over the entire duration of the data.  In the second aspect, some timestamps are represented in some data files but not in others.  You may wish to refer to the "Missing Data" section of this chapter and the same-named one in chapter 4 (*Anomaly Detection*); as well, chapter 7 discussion of date/time fields is likely relevant.

**Task 3**: Remove all regular trends and cycles from the data.  The relevant techniques may vary between the different instruments.  As we noted in the discussion in this chapter, three measurement series are of indoor temperatures regulated, at least in part, by thermostat, and one is of outdoors temperatures.  Whether or not the house in question had differences in thermostats or heating systems between rooms is left for readers to try to determine based on the data (at very least though, heat circulation in any house is always imperfect and not uniform).

Note: As a step in performing detrending, it may be useful to temporarily impute missing data, as is discussed in chapter 6 (*Value Imputation*).

**Task 4**: Characterize every data point (timestamp and location) according to these categories:

* "Regular" data point that falls within generally expected bounds.
* "Interesting" data point that is likely to indicate relevant deviation from trends.
* "Data error" that reflects an improbable value relative to expectations, and is more likely to be a recording or transcription error.  Consider that a given value may be improbable based on its delta from nearby values and not exclusively because of absolute magnitude.  Chapter 4 is likely to be relevant here.
* Missing data point.

**Task 5**: Describe any patterns you find in the distribution of characterized data points.  Are there temporal trends or intervals that show most or all data characterized in a certain way? Does this vary by which of four instruments we look at?

### Oversampled Polls

Polling companies often deliberately utilize oversampling (overselection) in their data collection.  This is a somewhat different issue than the overweighting discussed in a topic of this chapter, or than the mechanical oversampling addressed in chapter 6 (*Value Imputation*).  Rather, the idea here is that a particular class, or a value range, is known to be uncommon in the underlying population, and hence the overall parameter space is likely to be sparsely filled for that segment of the population.  Alternately, the oversampled class may be common in the population but also represents a subpopulation about which the analytic purpose needs particularly high discernment.

Use of oversampling in data collection itself is not limited to human subjects surveyed by polling companies.  There are times when it similarly makes sense for entirely unrelated subject domains; e.g. the uncommon particles produced in cyclotrons or the uncommon plants in a studied forest.  Responsible data collectors, such as the Pew Research Center that collected the data used in this exercise, will always explicitly document their oversampling methodology and expectations about the distribution of the underlying population.  You can, in fact, read all of those details about the 2010 opinion survey we utilize at:

> https://www.pewsocialtrends.org/2010/02/24/millennials-confident-connected-open-to-change/

However, to complete this exercise, we prefer you skip initially consulting that documentation.  For the work here, pretend that you received this data without adequate accompanying documentation and metadata (just to be clear: Pew is meticulous here).  Such is all too often the case in the real world of messy data.  The raw data, with no systematic alteration to introduce bias or oversampling, is available by itself at:

> https://www.gnosis.cx/cleaning/pew-survey.csv

**Task 1**: Read in the data, and make a judgement about what ages were deliberately over- or undersampled, and to what degree.  We may utilize this weighting in later synthetic sampling or weighting, but for now simply add a new column called `sampling_multiplier` to each observation of the data set matching your belief.  

For this purpose, treat 1x as the "neutral" term.  So, for example, if you believe 40 year old subjects were overselected by 5x, assign the multiplier 5.0.  Symmetrically, if you believe 50 year olds were systematically underselected by 2x, assign the multiplier 0.5.  Keep in mind that humans in the United States in 2010 were not uniformly distributed by age.  Moreover, with a sample size of about 2000 and 75 different possible ages, we expect some non-uniformity of subgroup sizes simply from randomness.  Merely random variation from the neutral selection rate should still be coded as 1.0.

**Task 2**: Some of the categorical fields seem to encode related but distinct binary values.  For example, this question about technology is probably not ideally coded for data science goals:

In [75]:
pew = pd.read_csv('data/pew-survey.csv')
list(pew.q23a.unique())

['New technology makes people closer to their friends and family',
 'New technology makes people more isolated',
 '(VOL) Both equally',
 "(VOL) Don't know/Refused",
 '(VOL) Neither equally']

Since the first two descriptions may either be mutually believed or neither believed by a given surveyed person, encoding each as a separate boolean value makes sense.  How to handle a refusal to answer is an additional decision for you to make in this reencoding.  Determine which categorical values should better be encoded as multiple booleans, and modify the data set accordingly.  Explain and justify your decisions about each field.

**Task 3**: Determine whether any other demographic fields than age were oversampled.  While the names of columns are largely cryptic, you can probably safely assume that a field with qualitative answers indicating degree of an opinion are dependent variables surveyed rather than demographic independent variables.  For example:

In [76]:
list(pew.q1.unique())

['Very happy', 'Pretty happy', 'Not too happy', "(VOL) Don't know/Refused"]

You may need to consult outside data sources to make judgements for this task.  For example, you should be able to find the rough population distribution of U.S. timezones (in 2010) to compare to the data set distribution.

In [77]:
list(pew.timezone.unique())

['Eastern', 'Central', 'Mountain', 'Pacific']

**Task 4**: Some fields, such as `q1` presented in Task 3 are clearly ordinally encoded.  While it is not directly possible to assign relative ratios for (Very happy:Pretty happy) versus (Pretty happy:Not too happy), the ranking of those three values is evident, and calling them ordinal 1, 2, and 3 is reasonable and helpful.  You will, of course, also have to encode refusal to answer in some fashion.  Re-encode all relevant fields to take advantage of this intuitive domain knowledge you have.