Netnaija free is media download center created on March of 2016. The project aims webscraping data from the movies section. Movies that has been uploaded from 2016 till date will be scarped. Assuming the rol of Data analyst the scraped data will be cleaned and analyze for insights. The main moltivation behind this project was the the lack of dirty data to paractise my data cleaning skill and Netnaija been one of my favourite movie site I decidec to scrape and analyse it data.
For the purpose of the analysis, Each movietitle, movie link, movie type, time of upload,movie length, number of comment, movie summary, Genre, Release Date, Stars, movie Languages, movie Subtitles, IMDB links were scraped. Not all not all feature listed werepresent in some movie in which case I replaced them with missing. The Scraping was done using using Beatutiful soup and Request libarary. Click here for dataset and here for scraping script.
To get update data from the site 1. create a folder 2. open terminal in the create folder and run the following code one by one.
# clone the project
git clone https://github.com/Sachimugu/Net_Naija.git
# enter the Script directory
cd Net_Naija/Web_Scraping/
# Create a conda virtual environment called ws and install all the packages
conda create --name webscraping pandas requests beautifulsoup4 tdqm lxml
# Activate the conda environment
conda activate webscraping
# install tqdm the packages
pip install tqdm
# run scrpt
python scraping_script.py
After getting the scraped data, pandas package was used to do some cleaning by droping duplicate rows, removing parantenses and semi colon present in the scraped data. Columns that were not significant to the analysis were also droped. Regex was used to parse the movie from the title added it to the upload date. Each column was casted to the appropriate date type. Nan values were removed, replaced or left alone depending on the column and final index was set to be the upload date. After the process 82% of row and 11 columns were left for the analysis.
Top three stars with most number of movies are:
Nicolas Cage | Bruce Willis | Samuel L. Jackson |
Top 4 genre with others genre in a pie chart
Drama: 20.25%, Comedy: 11.70%, Thriller: 11.27%,Action: 10.99%, Other: 45.78%
Upload Trend They seems to be a surge in number of upload in the first quater of each year.
Explore the notebook file here Explore the full report here
├── assets
│ ├── banner.png <----- Readme banner
│ ├── nicolas.jpeg <----- Readme Nicolas Cage photphoto
│ ├── pie.png <----- Readme Pie chart
│ ├── samuel.png <----- Readme Samuel l jackson phot
│ ├── script.png <----- Readme Scraping photo
│ ├── Trend.png <----- Readme Trend photo
│ └── willis.jpeg <----- Readme Bruce Willis photo
├── Dataset
│ └── netnaija_movie.csv <----- Dataset
├── Notebook
│ ├── Data_Wrangling_and_EDA.html <----- Notebook code in html
│ └── Data_Wrangling_and_EDA.ipynb <----- Notebook code
├── Readme.md <----- GitHub Readme file
├── Report
│ └── netnaija.pdf <----- Analysis Report
└── Web_Scraping
└── scraping_script.py <----- Webscraping Script