# DAND - Data Wrangling Project

## Wrangling Report
Tom Schonig, February 26th 2019

This project required wrangling data about the WeRateDogs twitter account, covering the full data wrangling lifecycle:
    - Gathering; obtaining our data from (3) sources, in (3) formats.
    - Assessing; reviewing our data for quality and tidiness issues, and documenting our observations.
    - Cleaning; performing tasks to remediate the identified quality and tidiness issues.

Prior to starting the data wrangling process, the project documentation was carefully reviewed to understand both the project's objectives and any constraints that needed to be explicitly considered (in addition to use of 'best practices' and meeting rubric criteria).

### Requirements
The below requirements were documented in the "Project Motivation" and "Project Details" pages:
 - Use only original rating tweets that have images; do not use retweets and non-rating tweets
 - At least (8) quality issues and (2) tidyness issues must be documented and remediated
 - At least (3) insights and (1) visualization must be produced
 - Written reports must be prepared:
     - 300 - 600 words describing the wrangling efforts (named 'wrangle_report'; submitted in PDF or HMTL)
     - 250-word minimum communicating insights and analyses (named 'act_report'; submitted in PDF or HTML)
 - Store the clean DataFrame in a CSV named 'twitter_archive_master.csv', as well as other tables required for tidiness
 - The "WeRateDogs" twitter archive must be downloaded manually and read into the .ipynb
 - The tweet image predictions must be requested programmatically using the provided URL
 - API tokens or credentials must not be included in final submission
 - Wrangling must capture each rating's:
     - Count of retweets
     - Count of favorite/"like" interactions
     - Tweet ID

Udacity requires the use of the (3) data sources below:
 - An twitter archive of the "WeRateDogs" account; provided in CSV format
 - Additional data from the Twitter API; using Tweepy
 - Image prediction data from a Udacity neural net; hosted in TSV format

The below were explicitly listed as non-requirements:
   - Full sanitization of all data
   - Rating ratios > 1 are valid and do not need to be cleaned
   - Tweets do not need to be gathered beyond August 1st 2017

## Methodology
 
 ### Data Gathering
     This phase was straightforward, where we followed explicit instructions on how to access each data source.
     
         - 'twitter_achive.csv' was manually uploaded to the project workspace per Udacity instruction, and read 
         into a Pandas DataFrame named 'twitter_archive'
         
         - The Twitter API was called using the Tweepy library, using the 'twitter_archive' tweet_ids as a parameter. 
         The tweet JSON data was written to a text file named 'tweet_json', which was then read back and assembled 
         into a new DataFrame called 'twitter_api_data'. 
             - Because the API calls took a long time, using "try" and "except" blocks were essential, as encountering 
             a record break with our 'twitter_archive' IDs caused the cell to error. This was learned the hard way.
             
         - The Udacity image prediction data was programmatically opened, read, and written locally. It seemed like 
         a useful exercise to add some extra logic, to check locally and only fetch the file if no copy existed locally.
         The local copy is then read into a DataFrame.

 ### Data Assessment
     The data was manually and programmatically reviewed for quality and tidiness. The below questions were framed as 
     guidance for inquiry.

   #### Quality
        - Completeness : Are there missing records or values within and between tables?
        - Accuracy : Is there wrong data that conforms to each column's validation rules?
        - Consistency : Is the same information represented consistently across sources?
        - Validity : Does any data break validation rules required by our schema? Do they defy real-world constraints?
        
   #### Tidiness
        - Does each column represent a distinct variable?
        - Does each observation occupy its own row?
        - Does each type of observation have its own table?

    To answer these questions, I began looking at the column structure in each dataframe, the record counts, and the 
    proportion of NaN values. Thankfully, there was already a common key between the sources and no overlapping columns. 
    Likewise, though each dataframe had different record counts, they weren't major and obvious differences (ie. orders
    of magnitude differences). I also checked for duplicates, to see if each record truely represented unique observations.
    
    I recorded some general observations around completeness, tidiness, and consistency, and then began looking at 
    individual columns. The archive data column names were fairly descriptive. I leaned on the value_counts() method to 
    get a cursory understanding how complex each column was, ie. whether it naturally lent to categorization or 
    validation rules, or if it was tracking less tractable information. I tried to consider the real-world context 
    for each column, in order to highlight major deviations or to inspire validation rules that could be used to
    programmatically check each column (eg. proper nouns in the 'name' column should have at least one uppercase character).
    
    Programmatic checks were used to guide manual reviews, which were ultimately necessary to elicit some corrections 
    in the data. I would later loop back into assessment after cleaning the structural issues with the data, to drill-deeper
    into accuracy, consistency, and validity issues.

##### Observations

Here are the consolidated notes from the assessment phase.

Quality -

     - The record counts do not match between data sources

    'twitter_archive'
        - "source" column mistypified; should be category. Needs to be parsed to be intelligible
        - "doggo", "floofer", "pupper", "puppo" columns are strings; some dogs occupy multiple 'stages'
        - Numerators and denominators appear to contain both inconsistent and inaccurate data
        - There appears to be (137) duplicated "expanded_urls" values
        - the timestamp columns are strings
        - contains retweets and replies, ie. non-rating tweets
        - Some columns contain NaNs: 'in_reply_to_status_id','in_reply_to_user_id','retweeted_status_id',
            'retweeted_status_user_id','retweeted_status_timestamp','expanded_urls'
                * The NaNs in the 'in_reply...' and 'retweet...' columns will be helpful for identifying non-ratings tweets
                * The explicit requirement is for original rating tweets with images - so missing urls may be dropped as 
                   well
        - There are (109) values in the 'name' column that are all lowercase, which appear to be dirty data (as proper 
        nouns, these should have at least one capital)
        - tweet_id 	832645525019123713 did not contain an actual dog rating

    
    'twitter_api_data'
        - columns imported as strings instead of integers
    
    'image_predictions'
        - Column headings are not intelligible
        - There appears to be duplicated "jpg_url" fields

Tidiness -

    - image_predictions - "p1" , "p2" headings iterate because they are expressing another value: iteration of 
    the prediction for each dog
    - 'twitter_api_data' should be consolidated with the 'twitter_archive' data


 ### Data Cleaning
     Once the major quality and tidiness issues were identified and documented, the wrangling exercise moved toward cleaning.
     This required defining exactly how the data should be represented in our desired dataset, followed by coding and 
     testing solutions to confirm that we have remediated the issues.
     
     The cleaning process started with addressing completeness and tidiness issues, as to facilitate further cleaning 
     efforts and prevent potential rework or making the same amendments across multiple sources. Because our project 
     requirements documented a complete dataset as one containing only original ratings tweets with images, the 
     most appropriate place to begin cleansing was the 'extra' records in the archive data. Some parameters to flag these 
     were identified in the assessment phase:
         - Tweets with data populated indicating that it was a retweet or reply (the presence of non-NaN values)
         - Tweets without URLs, which indicate that the tweet had no image (the presence of NaNs)
         - Tweets with duplicated URLs, suggesting it may not be an original ratings tweet
         
     These records were programmatically selected, quantified, reviewed, and dropped. The dataframe was re-assesed to 
     confirm that all columns now represented NaNs in-keeping with the above expectations. 
     
     Then the API data was joined to the archive for tidiness, since the dataframe contained data that extended the 
     same observation unit / record type of the archive. Once joined, the consolidated dataframe was again checked for NaNs,
     as it was expected that there would not be full matching of records. New NaNs were discovered, relating records 
     that did not fully overlap between data sources. These were again assessed, and eventually dropped for completeness.
     
     Addressing the last identified tidiness issue, the image prediction dataframe was restructured using the melt() and
     concatenate() methods. This was necessary because the table had redundant columns for each iteration of predictions, 
     which were not distinct variables requiring their own columns. Likewise, rows existed for each tweet, where the purpose
     of the table appeared to be tracking predictions. After melting and reassembling the dataframe, checks were performed 
     to ensure that the record counts on the new table matched expectation (3x more than the original) and the table's head
     was printed for visual examination. 
     
     To set expectations during for subsequent analyses, the prediction data was joined to the master archive data and
     checked for NaNs / record breaks. If this dataset did not overlap with our archive, it may not be relevant for use in
     analysis. We quantified the breaks at under 6% the new archive data, which gave comfort that the records could be 
     used later (with some understood limitations). 
     
     Cleaning then proceeded to accuracy, validity and consistency issues. Data types were converted, and some fields 
     were parsed for intelligibility in subsequent analysis. Where project documentation had suggested the 'doggo', etc, 
     fields could treated as a 'dog stage' column, I did not agree with this approach and saw an opportunity to pass these
     columns into linear regression models by converting them to binary integers. Accordingly, these columns were looped 
     over to make the changes, printing the value counts before and after to confirm the changes.
     
     I then iterated back into assessing the data, using extreme 
     values in the ratings field to check for quality issues, aprioristically expecting two major forces at play:
        1) Transcription issues from programmatically extracting the ratings
        2) Legitimately outlandish scores, which would make for amusing manual review
    
    Both expectations were met by the assessment. Where errors were observed, they were noted for future clean-up. Full 
    notes from assessment of all the data sources are below. 

### Cleaning Summary
Two tidiness issues have been addressed:
    1) Separation of the 'twitter_archive' and 'twitter_api_data' has been remediated by merging the two dataframes
    2) Embedded variability in the columns of 'image_predictions' has been remediated by melting the dataframe and 
    combining the melts, so that each prediction iteration is denoted by a column value instead of separate headings

Additionally, ten quality issues have been addressed:
    - Only original rating tweets remain, arrived at by:
        1) Dropping retweets and replies
        2) Dropping tweets without images
        3) Dropping (1) tweet containing a duplicated image URL
        4) Dropping (1) tweet containing a GoFundMe solicitation, and no rating
    - Converting datatypes where required for utility and consistency
        5) Parsing out the HTML from the 'source' column and converting it to a categorical datatype
        6) Replacing the strings in the 'doggo', 'floofer', 'puppo', and 'pupper' columns with integers and 
            converting the datatype
        7) Converting the 'timestamp' and 'retweet_timestamp' columns to datetime datatypes
        8) 'twitter_api_data' columns read as strings have been converted to integers
    - Column headings are intelligible
        9) Image prediction columns have been renamed for self-descriptiveness
    - Cleansed dirty values from 'rating_numerator' and 'rating_denominator' columns through the identification of
        both extreme and mis-proportioned values, manually identifed errors in programmatic extraction
        10) Applied corrected numerator and denominator values from each tweet using a dictionary of tweet_ids and values
 
From the original assessment observations, the below remain unaddressed:

    1) Completeness - The record counts do not match between data sources. However, we have sanitized (2) of the (3) sources
    to match completely, and have quantified the number of breaks with the image prediction data [it is acceptable]. The 
    new 'twitter_archive_master' dataframe is internally coherent, and we've kept the 'image_prediction_cleaned' data
    separate for tidiness, as it represents a different type of observation (public tweet vs. prediction event).

    2) There known data quality isuses in the 'name' column. However, we will not address this, as perfect data quality here 
    is not required by the project, and the utility of sanitizing this is likely not worth the effort. Effort here does 
    not seem fruitful toward improving completeness or tidiness, and anonymization of the dogs does not materially impact 
    our analyses. In fact, while there is no explicit requirement for dogs, certain types of analyses may require 
    the obfuscation or anonymization of any personally identifiable information
        

## Resources Used

 - Udacity DAND classroom materials
 - Stack Overflow : https://stackoverflow.com/
 - Stack Abuse : https://stackabuse.com/
 - Pandas documentation : https://pandas.pydata.org/pandas-docs/stable/index.html
 - Tweepy documentation : http://docs.tweepy.org/en/v3.5.0/
 - JSON documentation : https://docs.python.org/3/library/json.html
 - cran-r.project.org, 'Tidy Data' : https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html
 - docs.python.org, request documentation : http://docs.python-requests.org/en/master/user/quickstart/
 - docs.python.org, os documentation : https://docs.python.org/3/library/os.path.html