WeRate-Dogs-Twitter-Data

I wrangled and analyzed the WeRateDogs (@dog_rates) Twitter data

I have successfully gathered, assessed, cleaned and visualized the twitter_archive dataset, image_predictions dataset, and json data file obtained after querying twitter API. My data wrangling process began with gathering all three datasets to be used in the project.

I directly downloaded the WeRateDogs Twitter archive data from the classroom and read it into a Pandas DataFrame, I downloaded the image predictions dataset from the url provided using the request and the os libraries.

I queried each tweet's retweet count and favourite count using the Tweepy library and stored the data in tweet_json.txt. Thereafter, I read the tweet_json.txt line by line into a pandas DataFrame with tweet_id, favourite count, and retweet count.

For the data assessing, I assessed my data both visually and progamatically, and observed the following issues:

Quality Issues in The Twitter Archive Dataset

Some dog names are not actual names
full html link should be replaced with the actual source in the source column
remove retweets by dropping rows with values in the retweeted_status_id column
drop the following columns: in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id column, retweeted_status_user_id, retweeted_status_timestamp, text.
after spliting timestamp column, drop the timestamp column
drop pupper, doggo, puppo and floofer columns after merging into one "dog_stage"
drop expanded_urls column

Tidiness Issues in The Twitter Archive Dataset

The timestamp column should be split into date and time columns, and dtype of the date column should be changed to datetime
doggo, floofer, pupper, and puppo columns should be merged into one (dog_stage)

Quality Issues in The Image Predictions Dataset

some dog breeds are not actual dog breeds. remove p1_dog, p2_dog, p3_dog values set as 'False: as they are not dogs of any breed

Tidiness Issues in The Image Predictions Dataset

For each row, generate each maximum p_conf value and the corresponding p and p_dog values

Additional Quality Issues

change data type of tweet_id column in all three datasets to string object before merging

Additional Tidiness Issues

Merge all three datasets

All Issues Observed were Addressed and Cleaned. After cleaning, all three datasets were merged. I made the following insights:

Insights:

The most common source of tweets was Twitter for iphone, and the least common source was TweetDeck The most common dog name is Cooper A dog named 'Stephan' from the Chihuahua breed had the highest number of likes and retweets

The following visualizations were made

distribution of the top 20 dog breeds
top 20 most common dog names
distribution of the least 20 dog breeds

After assessing my data, i made a copy each of my three datasets before cleaning. I successfully cleaned all issues identified during assessing. After cleaning, I saved the gathered and combined dataset into master_twitter_archive.csv.

Thereafter, I generated some insights and made some visualizations

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
README.md		README.md
act_report.pdf		act_report.pdf
dog_predictions.csv		dog_predictions.csv
image-predictions.tsv		image-predictions.tsv
json_data.csv		json_data.csv
tweet_json.txt		tweet_json.txt
twitter-archive-enhanced.csv		twitter-archive-enhanced.csv
twitter_archive_master.csv		twitter_archive_master.csv
wrangle_act.ipynb		wrangle_act.ipynb
wrangle_report.html		wrangle_report.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WeRate-Dogs-Twitter-Data

Quality Issues in The Twitter Archive Dataset

Tidiness Issues in The Twitter Archive Dataset

Quality Issues in The Image Predictions Dataset

Tidiness Issues in The Image Predictions Dataset

Additional Quality Issues

Additional Tidiness Issues

Insights:

The following visualizations were made

About

Releases

Packages

Languages

DewunmiGiwa/WeRate-Dogs-Twitter-Data

Folders and files

Latest commit

History

Repository files navigation

WeRate-Dogs-Twitter-Data

Quality Issues in The Twitter Archive Dataset

Tidiness Issues in The Twitter Archive Dataset

Quality Issues in The Image Predictions Dataset

Tidiness Issues in The Image Predictions Dataset

Additional Quality Issues

Additional Tidiness Issues

Insights:

The following visualizations were made

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages