This repo will help you to learn how you can scrape data from any website using BeautifulSoup and apply Exploratory Data Analysis on the extracted data for finding useful insights.
Web Scraping (also termed Screen Scraping, Web Data Extraction, Web Harvesting etc.) is a technique employed to extract large amounts of data from websites whereby the data is extracted and saved to a local file in your computer or to a database in table (spreadsheet) format.
Here, the work has been broadly categorized into two steps
- Scraping data from website
- Exploratory data analysis
We are going to scrap the data from website using a python library called BeautifulSoup which is mainly used for parsing the html/xml files for scraping the data from it. For our purpose, we are going to extract data of top 1000 movies from IMDB website.
Each page contains data of only 50 movies, that's why we are going to change the URL parameter through iterating with a set of leap variables of 50 for url extension to parse through next page.
We are going to understand:- Use of container for parsing? How data formatting is done and why it is needed? Why data wrangling is needed?
After parsing, data formatting and data wrangling our final dataset will look like

We are going to perform EDA on the dataset with following steps
1. Find out the data inconsistancy/missing values
2. How to resolve those missing values
3. Plotting graphs on the basis of the subsequent data (part of statistical analysis)
We will understand how to solve the missing values problem with different methods and then finally plotting graphs like Swarmplot with seaborn
Pie chart
Bar Graph



