# **Udemy Course Scraper**
[![Python](https://img.shields.io/badge/Python-3.9%2B-blue.svg)](https://python.org) [![Selenium](https://img.shields.io/badge/Selenium-Automation-green.svg)](https://www.selenium.dev/) [![License](https://img.shields.io/badge/License-Apache-red.svg)](LICENSE)

A Python-based scraper that extracts **Udemy courses** dynamically rendered with JavaScript. It utilizes **Selenium** for scraping and processes the data into a clean, organized DataFrame with **Pandas** for further analysis.

---

## **Features**  
- Scrape Udemy courses dynamically using **Selenium**.  
- Extract key details like **Name**, **Description**, **Instructor**, **Rating**, **Price**, and more.  
- Save raw data to a `.txt` file for further processing.  
- Clean and organize data into a neat **Pandas DataFrame**.  

## **Why not use `BeautifulSoap`?**
`BeautifulSoup` is a great tool for parsing HTML and XML documents, but it's not the best choice for this task. Here's why:
*   `BeautifulSoup` is designed for parsing HTML and XML documents, not JSON data.
*   It's not optimized for parsing large JSON data, which can lead to performance issues. 

The courses in **Udemy** are renderd dynamically using `JavaScript`, so we can't use `BeautifulSoap`.


---

## **Setup Instructions**  

### 1. Clone the Repository  
```bash  
git clone https://github.com/yourusername/udemy-course-scraper.git  
cd udemy-course-scraper  
```  

### 2. Install Dependencies  
To install the required libraries, run the following command:  
```bash
pip install selenium pandas numpy
```


### 3. Run the Scraper  
Follow these steps to execute the scraper:

1. **Set up ChromeDriver**:  
   - Download the appropriate [ChromeDriver](https://chromedriver.chromium.org/downloads) version for your Chrome browser.  

2. **Run the Script**:  
   Execute the Python script:  
   ```bash  
   python scraper.py  
   ```

3. **Data Output**:  
   The raw scraped data will be saved in a file named `data.txt`.  

---

## **Code**

### **Scraping the Data**  


First, let's import `Selenium`:

In [9]:
from selenium import webdriver  
from selenium.webdriver.common.by import By  

Now, let's setup the Webdriver:

In [None]:
driver = webdriver.Chrome()  
driver.get("https://www.udemy.com/courses/development/data-science/")  

# Wait for the page to load completely  
driver.implicitly_wait(10)  

In [None]:
# Locate the course list container and extract child elements  
parent = driver.find_element(By.CLASS_NAME, 'course-list_container__yXli8')  
children = parent.find_elements(By.CLASS_NAME, 'course-list_card-layout-container__F2SfZ')  

In [None]:
# Save the extracted data into a text file  
with open("data.txt", "w", encoding="utf-8") as file:  
    for child in children:  
        element = child.find_element(By.CLASS_NAME, "popper_popper__jZgEv").text  
        file.write(element + "\n================================\n")  

In [None]:
driver.quit()

- **`implicitly_wait(10)`**: Ensures all elements are loaded before proceeding.  
- **Find Elements**: Identifies the container of courses and iterates over child elements to extract data.  
- **Save Data**: Writes the raw scraped text to `data.txt`.  

---

### **Cleaning and Organizing Data**  
The following script processes and cleans the scraped data to produce a structured DataFrame:  

In [10]:
import pandas as pd
import numpy as np

In [11]:
courses_list = []
with open("data.txt") as file:
    courses_list = file.read().split("================================")

In [12]:
courses_list = [x.split("\n") for x in courses_list]
del courses_list[-1]

In [13]:
data = []
for course in courses_list:
    element = []
    del course[-1]
    
    if course[-1] == "Bestseller":
            course[-1] = "Yes"
    else:
        course.append(np.nan)
    for piece in course:
        if isinstance(piece,str) and (piece in ("Instructor:", "Instructors:", "Current Price", "Original Price", "Current price", "Original price", "") or "Rating:" in piece):
            continue
        
        element.append(piece)
    data.append([element[0],*element[2:]])

In [14]:
courses = pd.DataFrame(data, columns=["Name", 
                                "Description", 
                                "Instructor", 
                                "Rating", 
                                "Number of Ratings", 
                                "Total Hours", 
                                "Number of Lectures", 
                                "Level", 
                                "Current Price", 
                                "Original Price", 
                                "Bestseller"])
courses.head()

ValueError: 11 columns passed, passed data had 10 columns

In [7]:
dataset = courses.to_csv("output.csv", index=False)

In [8]:
dataset = courses.to_excel("output.xlsx", index=False)