# **Udemy Course Scraper**
[![Python](https://img.shields.io/badge/Python-3.9%2B-blue.svg)](https://python.org) [![Selenium](https://img.shields.io/badge/Selenium-Automation-green.svg)](https://www.selenium.dev/) [![License](https://img.shields.io/badge/License-Apache-red.svg)](LICENSE)

A Python-based scraper that extracts **Udemy courses** dynamically rendered with JavaScript. It utilizes **Selenium** for scraping and processes the data into a clean, organized DataFrame with **Pandas** for further analysis.

---

## **Features**  
- Scrape Udemy courses dynamically using **Selenium**.  
- Extract key details like **Name**, **Description**, **Instructor**, **Rating**, **Price**, and more.  
- Save raw data to a `.txt` file for further processing.  
- Clean and organize data into a neat **Pandas DataFrame**.  

## **Why not use `BeautifulSoap`?**
`BeautifulSoup` is a great tool for parsing HTML and XML documents, but it's not the best choice for this task. Here's why:
*   `BeautifulSoup` is designed for parsing HTML and XML documents, not JSON data.
*   It's not optimized for parsing large JSON data, which can lead to performance issues. 

The courses in **Udemy** are renderd dynamically using `JavaScript`, so we can't use `BeautifulSoap`.


---

## **Setup Instructions**  

### 1. Clone the Repository  
```bash  
git clone https://github.com/yourusername/udemy-course-scraper.git  
cd udemy-course-scraper  
```  

### 2. Install Dependencies  
To install the required libraries, run the following command:  
```bash
pip install selenium pandas numpy
```


### 3. Run the Scraper  
Follow these steps to execute the scraper:

1. **Set up ChromeDriver**:  
   - Download the appropriate [ChromeDriver](https://chromedriver.chromium.org/downloads) version for your Chrome browser.  

2. **Run the Script**:  
   Execute the Python script:  
   ```bash  
   python scraper.py  
   ```

3. **Data Output**:  
   The raw scraped data will be saved in a file named `data.txt`.  

---

## **Code**

### **Scraping the Data**  


First, let's import `Selenium`:

In [35]:
from selenium import webdriver  
from selenium.webdriver.common.by import By  

Now, let's setup the Webdriver:

In [36]:
driver = webdriver.Chrome()  
# To scrape the first page only
driver.get("https://www.udemy.com/courses/development/data-science/")  

# Wait for the page to load completely  
driver.implicitly_wait(10)  

If we compare the data shown in the *Elements* tab in the browser's developer tools with the data we scraped, we'll notice a difference in the courses' classes.


To work around this problem, we'll first retrieve the parent element, then retrieve the children.

In [37]:
# Locate the course list container and extract child elements  
parent = driver.find_element(By.CLASS_NAME, 'course-list_container__yXli8')  
children = parent.find_elements(By.CLASS_NAME, 'course-list_card-layout-container__F2SfZ')  

Finally, save the data we got in an external `txt` file:

In [38]:
# Save the extracted data into a text file  
with open("data.txt", "w", encoding="utf-8") as file:  
    for child in children:  
        element = child.find_element(By.CLASS_NAME, "popper_popper__jZgEv").text  
        file.write(element + "\n================================\n")  

In [39]:
driver.quit()

- **`implicitly_wait(10)`**: Ensures all elements are loaded before proceeding.  
- **Find Elements**: Identifies the container of courses and iterates over child elements to extract data.  
- **Save Data**: Writes the raw scraped text to `data.txt`.  

---

### **Cleaning and Organizing Data**   

In [40]:
import pandas as pd
import numpy as np

The data is seperated with a delimiter, we'll first need to split the courses into seperate `list` elements:

In [41]:
courses_list = []
with open("data.txt") as file:
    courses_list = file.read().split("================================")

Now to further split the courses' information into title, description, etc..

In [42]:
courses_list = [x.split("\n") for x in courses_list]
del courses_list[-1]

courses_list

[['Machine Learning A-Z: AI, Python & R + ChatGPT Prize [2024]',
  'Learn to create Machine Learning Algorithms in Python and R from two Data Science experts. Code templates included.',
  'Learn to create Machine Learning Algorithms in Python and R from two Data Science experts. Code templates included.',
  'Instructors:',
  'Kirill Eremenko, Hadelin de Ponteves, SuperDataScience Team, Ligency Team',
  'Rating: 4.5 out of 5',
  '4.5',
  '(191,570)',
  '43 total hours',
  '387 lectures',
  'All Levels',
  'Current price',
  'EÂ£249.99',
  'Original Price',
  'EÂ£2,099.99',
  'Bestseller',
  ''],
 ['',
  'Python for Data Science and Machine Learning Bootcamp',
  'Learn how to use NumPy, Pandas, Seaborn , Matplotlib , Plotly , Scikit-Learn , Machine Learning, Tensorflow , and more!',
  'Learn how to use NumPy, Pandas, Seaborn , Matplotlib , Plotly , Scikit-Learn , Machine Learning, Tensorflow , and more!',
  'Instructors:',
  'Jose Portilla, Pierian Training',
  'Rating: 4.6 out of 5',
  

As we can see, the data isn't ready to be used yet, there are duplicates and unnecessary elements. We'll have to clean this up:

In [43]:
data = []
for course in courses_list:
    element = []
    
    # Since the list element is a result of using '\n', we don't need it
    del course[-1]
    
    if course[-1] == "Bestseller":
            course[-1] = "Yes"
    else:
        course.append(np.nan)
        
    for piece in course:
        # To remove unwanted data
        if isinstance(piece,str) and (piece in ("Instructor:", "Instructors:", "Current Price", "Original Price", "Current price", "Original price", "") or "Rating:" in piece):
            continue
        
        element.append(piece)
    data.append([element[0],*element[2:]])

Let's create a `DataFrame` to store our new data:

In [44]:
columns = ["Name", "Description", "Instructor", "Rating", "Number of Ratings",  
           "Total Hours", "Number of Lectures", "Level", "Current Price",  
           "Original Price", "Bestseller"]
courses = pd.DataFrame(data, columns=columns)
courses.head()

Unnamed: 0,Name,Description,Instructor,Rating,Number of Ratings,Total Hours,Number of Lectures,Level,Current Price,Original Price,Bestseller
0,"Machine Learning A-Z: AI, Python & R + ChatGPT...",Learn to create Machine Learning Algorithms in...,"Kirill Eremenko, Hadelin de Ponteves, SuperDat...",4.5,"(191,570)",43 total hours,387 lectures,All Levels,EÂ£249.99,"EÂ£2,099.99",Yes
1,Python for Data Science and Machine Learning B...,"Learn how to use NumPy, Pandas, Seaborn , Matp...","Jose Portilla, Pierian Training",4.6,"(147,245)",25 total hours,165 lectures,All Levels,EÂ£349.99,"EÂ£2,799.99",
2,The Data Science Course: Complete Data Science...,"Complete Data Science Training: Math, Statisti...",365 Careers,4.6,"(146,979)",31.5 total hours,520 lectures,All Levels,EÂ£249.99,"EÂ£1,899.99",Yes
3,R Programming A-Zâ„¢: R For Data Science With ...,Learn Programming In R And R Studio. Data Anal...,"Kirill Eremenko, SuperDataScience Team, Ligenc...",4.6,"(54,982)",10.5 total hours,80 lectures,All Levels,EÂ£349.99,"EÂ£2,999.99",Yes
4,"Deep Learning A-Z 2024: Neural Networks, AI & ...",Learn to create Deep Learning models in Python...,"Kirill Eremenko, Hadelin de Ponteves, SuperDat...",4.6,"(47,041)",22.5 total hours,189 lectures,All Levels,EÂ£349.99,"EÂ£2,599.99",Yes


**Further cleaning is necessary to start using the data**

In [45]:
dataset = courses.to_csv("output.csv", index=False)

In [46]:
dataset = courses.to_excel("output.xlsx", index=False)