# Data Wrangling - WeRateDogs

This report discusses the data wrangling efforts taken in preparation for the analysis of the WeRateDogs data. These efforts have been divided into three parts:
* Gathering the data
* Asssessing the data
* Cleaning the data
Therefore the discussions will be along these three sub-topics.

### Gathering Data

The data used for this analysis was gotten from three different sources.
1. The main data used is the **WeRateDogs twitter archive** that was downloaded manually as a `.csv` file from the Udacity learning platform. This was then read into a pandas dataframe using the `pd.read_csv` method.
2. In order to get the breed of the dog for each tweet, the **image prediction data** for the each tweet was downloaded programmatically as a `.tsv` file from where it was hosted using the `requests` library. This data was also read into a pandas dataframe using the same method decribed above.
3. To complement the main data gotten, other information about the tweets were gotten from Twitter API using `tweepy` library. The json data was read into a text file named `'tweet_json.txt'` which was then parsed into a list of dictionaries. This list was then converted into a dataframe using the `pd.
DataFrame` method.
A snapshot of the codes used can be seen below:

In [None]:
# load WeRateDogs Archive data from csv file
twitter_df = pd.read_csv("twitter-archive-enhanced.csv")
# load image prediction data from the url provided in the course
url = "https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv"
response = requests.get(url)
with open (url.split('/')[-1], mode = 'wb') as f:
    f.write(response.content)

# read the file into a pandas dataframe
image_df = pd.read_csv("image-predictions.tsv", sep = '\t')
# iterate throught the twitter dataset to get each tweet id
for _, item in twitter_df.iterrows():
    try:
        start = timer()
        tweet_id = item['tweet_id']

        # get additional data from twitter api
        tweet = api.get_status(tweet_id, tweet_mode='extended')

        # write data to tweet_json.txt file
        file_name = 'tweet_json.txt'
        if os.path.exists(file_name):
            with open (file_name, mode = 'a') as f:
                f.write(json.dumps(tweet._json) + '\n')
        else:
            with open (file_name, mode = 'wt') as f:
                f.write(json.dumps(tweet._json) + '\n')
        print(tweet_id)
        end = timer()
        print(end-start)
    except Exception as e:
        print(e)
tweets = []

# read data from tweet_json.txt file to a list of dictionaries
with open('tweet_json.txt','r') as file:
    for line in file:
        # convert json strings to dictionaries
        data = json.loads(line)
        # append the dictionaries to the tweets list
        tweets.append({k:v for (k,v) in data.items() if k in ('id','retweet_count','favorite_count')})
tweepy_df = pd.DataFrame(tweets)


### Assessing Data

In this stage, the three dataframes were assessed visually and programmatically for cleaniness issues. 8 quality issues and 3 tidiness issues were identified in this stage. The cleaniness issues are listed below:
##### Quality Issues

1. The records with non-null values in the `retweeted_status_id, retweeted_status_user_id` and `retweeted_status_timestamp` columns are retweets and not original tweets.
2. The records with non-null values in the `in_reply_to_status_id` and `in_reply_to_user_id` columns are replies and not original tweets.
3. The `source` column carries the html anchor tag instead of just the tweet source.
4. The `name` column contains wrong names for some of the dogs. An example is the record in index `2352`.
5. Some other columns asides the ones listed above are null but instead of `NaN`, we have 'None'. This will make working with null data difficult as we continue our analysis.
6. The `timestamp` and `retweeted_status_timestamp` columns have their datatypes as `object` rather than `datetime64[ns]`.
7. Only about 10% of the records in the `in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id`, `retweeted_status_user_id` and `retweeted_status_timestamp` columns are not null.
8. In the `image_df`, the case of the predictions is not consistent. Some are lower case while some others are title case.

##### Tidiness Issues
1. The `twitter_df` and `tweepy_df` have more rows than the `image_df`. This means some tweets don't have image predictions and might have to be dropped.
2. The `retweet_count` and `favorite_count` columns from the `tweepy_df`should be in the `twitter_df` instead.
3. The `twitter_df` contains both information about the tweets and the type of dog. The type of dog should be contained in a separate dataset.

### Cleaning the data

For each of the issues identified in the previous stage, the **define, code and test** technique was used to clean the issues from the data. Please check the `wrangle_act.ipynb` notebook for details on how each was cleaned.