Skip to content

Webscraping and exploratory data analysis of Netnaija Uploads.

Notifications You must be signed in to change notification settings

Sachimugu/Net_Naija

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Netnaija Exploratory Data Analysis

output

GitHub repo size Python version Open In Colab

Table of Contents

Introduction

Netnaija free is media download center created on March of 2016. The project aims webscraping data from the movies section. Movies that has been uploaded from 2016 till date will be scarped. Assuming the rol of Data analyst the scraped data will be cleaned and analyze for insights. The main moltivation behind this project was the the lack of dirty data to paractise my data cleaning skill and Netnaija been one of my favourite movie site I decidec to scrape and analyse it data.

Data Scraping

For the purpose of the analysis, Each movietitle, movie link, movie type, time of upload,movie length, number of comment, movie summary, Genre, Release Date, Stars, movie Languages, movie Subtitles, IMDB links were scraped. Not all not all feature listed werepresent in some movie in which case I replaced them with missing. The Scraping was done using using Beatutiful soup and Request libarary. Click here for dataset and here for scraping script.

output

Run Script

To get update data from the site 1. create a folder 2. open terminal in the create folder and run the following code one by one.

# clone the project
git clone https://github.com/Sachimugu/Net_Naija.git
# enter the Script  directory
cd Net_Naija/Web_Scraping/
# Create a conda virtual environment called ws and install all the packages
conda create --name webscraping pandas requests beautifulsoup4 tdqm lxml
# Activate the conda environment
conda activate webscraping
# install tqdm the packages
pip install tqdm
# run scrpt
python scraping_script.py

Data Cleaning

After getting the scraped data, pandas package was used to do some cleaning by droping duplicate rows, removing parantenses and semi colon present in the scraped data. Columns that were not significant to the analysis were also droped. Regex was used to parse the movie from the title added it to the upload date. Each column was casted to the appropriate date type. Nan values were removed, replaced or left alone depending on the column and final index was set to be the upload date. After the process 82% of row and 11 columns were left for the analysis.

Glance at Results

Top three stars with most number of movies are:

output output output
Nicolas Cage Bruce Willis Samuel L. Jackson

Top 4 genre with others genre in a pie chart

output

Drama: 20.25%, Comedy: 11.70%, Thriller: 11.27%,Action: 10.99%, Other: 45.78%

Upload Trend output They seems to be a surge in number of upload in the first quater of each year.

Explore the notebook file here Explore the full report here

Repository structure

├── assets
│   ├── banner.png                                      <----- Readme banner
│   ├── nicolas.jpeg                                    <----- Readme Nicolas Cage photphoto
│   ├── pie.png                                         <----- Readme Pie chart
│   ├── samuel.png                                      <----- Readme Samuel l jackson phot
│   ├── script.png                                      <----- Readme Scraping photo
│   ├── Trend.png                                       <----- Readme Trend photo
│   └── willis.jpeg                                     <----- Readme Bruce Willis photo
├── Dataset
│   └── netnaija_movie.csv                              <----- Dataset
├── Notebook
│   ├── Data_Wrangling_and_EDA.html                     <----- Notebook code in html
│   └── Data_Wrangling_and_EDA.ipynb                    <----- Notebook code
├── Readme.md                                           <----- GitHub Readme file
├── Report
│   └── netnaija.pdf                                    <----- Analysis Report
└── Web_Scraping
    └── scraping_script.py                              <----- Webscraping Script

Contact

About

Webscraping and exploratory data analysis of Netnaija Uploads.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published