# üìä Data Fetching Pipeline Overview

This notebook demonstrates **three fundamental approaches** to data collection:

1. üóÑÔ∏è **Database Queries** - Fetching structured data from MySQL databases
2. üåê **API Integration** - Retrieving data through RESTful API endpoints  
3. üï∏Ô∏è **Web Scraping** - Extracting data directly from HTML websites

---


## 1Ô∏è‚É£ üóÑÔ∏è Fetching Data from Database (MySQL)

### üì¶ **Pandas Library**
**Purpose:** Core Python library for data manipulation and analysis  
**Used for:** DataFrame operations, SQL integration, data cleaning  
**Import:** `import pandas as pd`


In [48]:
import pandas as pd

### üîå **MySQL Connection Dependencies**
**Installing Required Packages:**
- `mysql-connector` - Official MySQL driver for Python
- `sqlalchemy` - SQL toolkit and ORM
- `pymysql` - Pure Python MySQL client library


In [49]:
!pip install mysql.connector
!pip install sqlalchemy




### üì• **Import MySQL Connector**
**Library:** `mysql.connector`  
**Purpose:** Enables Python to communicate with MySQL database servers  
**Key Functions:** Establishing connections, executing queries, fetching results


In [50]:
import mysql.connector 

### üîó **Establish Database Connection**
**Function:** `mysql.connector.connect()`  
**Parameters:**  
- `host` - Database server address (localhost for local development)
- `user` - MySQL username
- `password` - Authentication password  
- `database` - Target database name

**Returns:** Connection object for executing SQL queries


In [51]:
conn = mysql.connector.connect(
    host='localhost',
    user='root',
    password='',
    database='world'
)


InterfaceError: 2003: Can't connect to MySQL server on 'localhost:3306' (10061 No connection could be made because the target machine actively refused it)

### üìä **Execute SQL Query & Load Data**
**Function:** `pd.read_sql_query()`  
**Purpose:** Executes SQL query and returns results as pandas DataFrame  
**Parameters:**  
- SQL query string
- Database connection object

**Use Case:** Fetching filtered data (e.g., US cities from 'city' table)


In [None]:
pd.read_sql_query("SELECT * FROM city WHERE CountryCode LIKE 'USA'", conn)

---

‚úÖ **Section 1 Complete:** Database Query Mastered!  
‚¨áÔ∏è **Next:** API Data Retrieval

---


---

## 2Ô∏è‚É£ üåê Fetching Data From API

### üì¶ **Requests Library**
**Purpose:** HTTP library for making API calls  
**Used for:** Sending GET/POST requests to web APIs and handling responses  
**Key Methods:** `get()`, `post()`, `json()`


### üåê **Making API Request**
**Process Flow:**
1. üìç **URL** - API endpoint address
2. üîë **Headers** - Authentication keys (x-rapidapi-key, x-rapidapi-host)
3. ‚öôÔ∏è **Query Parameters** - Search filters, pagination, sorting options
4. üì° **Send Request** - `requests.get()` with URL, headers, and params
5. üì• **Parse Response** - Convert JSON response to DataFrame using `response.json()`

**Example:** Fetching anime data from RapidAPI


In [None]:
import requests

url = "https://anime-db.p.rapidapi.com/anime"

querystring = {"page":"1","size":"10","search":"Fullmetal","genres":"Fantasy,Drama","sortBy":"ranking","sortOrder":"asc"}

headers = {
	"x-rapidapi-key": "5eadd6bc7emsh83392246ef39d66p14ffd4jsn9887126a5414",
	"x-rapidapi-host": "anime-db.p.rapidapi.com"
}

response = requests.get(url, headers=headers, params=querystring)

df=pd.DataFrame(response.json()['data'])

In [None]:
print(df.shape)

---

‚úÖ **Section 2 Complete:** API Integration Done!  
‚¨áÔ∏è **Next:** Web Scraping Techniques

---


---

## 3Ô∏è‚É£ üï∏Ô∏è Fetching Data From Web Scraping

### üì¶ **Web Scraping Libraries**
**Required Packages:**

**üêç BeautifulSoup (bs4)**
- HTML/XML parser for extracting data from web pages
- Navigates and searches the parse tree
- Usage: `from bs4 import BeautifulSoup`

**üîß lxml**
- Fast XML and HTML parser
- Backend parser used by BeautifulSoup
- Better performance for large documents

**üåê requests**
- HTTP library to fetch webpage HTML content
- Combined with BeautifulSoup for complete scraping workflow


In [52]:
import pandas as pd
import requests as re
from bs4 import BeautifulSoup
!pip install lxml
import lxml



### üé≠ **User-Agent Header**
**Purpose:** Tells the website we're visiting that this is a browser request  
**Why needed:** Helps avoid "Access Denied" (403) errors  
**Usage:** Pass in `headers` parameter to `requests.get()`


In [53]:
headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.3; Win 64 ; x64) Apple WeKit /537.36(KHTML , like Gecko) Chrome/80.0.3987.162 Safari/537.36'}

### üì° **Fetch Webpage HTML**
**Function:** `requests.get()`  
**Purpose:** Fetches raw HTML data from the given URL  
**Returns:** Response object with `.text` attribute containing HTML


In [54]:
webpage =requests.get('https://www.ambitionbox.com/list-of-companies?page=1',headers=headers).text

### üçú **Parse HTML with BeautifulSoup**
**Function:** `BeautifulSoup(webpage, 'lxml')`  
**Purpose:** Converts unstructured HTML into structured, searchable tree  
**Parser:** 'lxml' tells BeautifulSoup that data is HTML format  
**Why needed:** Makes it easier to navigate and extract specific data


In [55]:
soup = BeautifulSoup(webpage , 'lxml')

### üîç **Find All Company Cards**
**Function:** `soup.find_all('div', class_='companyCardWrapper')`  
**Purpose:** Finds all company card containers on the page  
**Returns:** List of all matching `<div>` elements  
**Usage:** First step to extract multiple companies from single page


In [56]:
company = soup.find_all('div', class_='companyCardWrapper')
len(company)

20

### üìù **Extract Company Details**
**Process:** Loop through each company card and extract:
- **Company Name** - Using `find('h2')`
- **Rating** - From `rating_star_container` class
- **Number of Reviews** - From `companyRatingCount` class
- **Company Type & Location** - From `interLinking` class (split by `|`)

**Data Storage:** Lists that will be converted to DataFrame


In [57]:
names=[]
rating=[]
No_Of_Reviews=[]
ctype=[]
locations=[]
company_type=''
location=''

for i in company:
    names.append(i.find('h2').text.strip())
    rating.append(i.find('div', class_='rating_star_container').text.strip())
    No_Of_Reviews.append(i.find('span' , class_='companyCardWrapper__companyRatingCount').text.strip())
    misc_info = i.find('span', class_='companyCardWrapper__interLinking')
    if misc_info:
    # Split by "|" to separate type and location
        parts = misc_info.text.split('|')
        
        # Company Type (first part)
        company_type = parts[0].strip() if len(parts) > 0 else ''
        ctype.append(company_type)
        
        # Location (second part)
        location = parts[1].strip() if len(parts) > 1 else ''
        locations.append(location)
    else:
        ctype.append('')
        locations.append('')
    

In [58]:
names

['TCS',
 'Accenture',
 'Wipro',
 'Cognizant',
 'Capgemini',
 'HDFC Bank',
 'Infosys',
 'ICICI Bank',
 'HCLTech',
 'Tech Mahindra',
 'Genpact',
 'Teleperformance',
 'Axis Bank',
 'Jio',
 'Concentrix Corporation',
 'Amazon',
 'iEnergizer',
 'Reliance Retail',
 'LTIMindtree',
 'IBM']

In [59]:
rating

['3.3',
 '3.7',
 '3.6',
 '3.6',
 '3.7',
 '3.8',
 '3.5',
 '4.0',
 '3.4',
 '3.4',
 '3.6',
 '3.8',
 '3.6',
 '4.4',
 '3.6',
 '3.9',
 '4.6',
 '3.9',
 '3.6',
 '3.9']

In [60]:
No_Of_Reviews

['(1.1L)',
 '(71.1k)',
 '(63.4k)',
 '(59.7k)',
 '(51.6k)',
 '(50.6k)',
 '(47.2k)',
 '(45.2k)',
 '(44.5k)',
 '(42.3k)',
 '(40.8k)',
 '(36.5k)',
 '(32.3k)',
 '(32.2k)',
 '(31.5k)',
 '(30.7k)',
 '(27.1k)',
 '(26.9k)',
 '(25.8k)',
 '(25.3k)']

In [61]:
ctype

['IT Services & Consulting',
 'IT Services & Consulting',
 'IT Services & Consulting',
 'IT Services & Consulting',
 'IT Services & Consulting',
 'Banking',
 'IT Services & Consulting',
 'Banking',
 'IT Services & Consulting',
 'IT Services & Consulting',
 'IT Services & Consulting',
 'BPO',
 'Banking',
 'Telecom',
 'BPO',
 'Internet',
 'BPO',
 'Retail',
 'IT Services & Consulting',
 'IT Services & Consulting']

In [62]:
locations

['Bangalore / Bengaluru +439 other locations',
 'Bangalore / Bengaluru +255 other locations',
 'Hyderabad / Secunderabad +370 other locations',
 'Hyderabad / Secunderabad +229 other locations',
 'Bangalore / Bengaluru +184 other locations',
 'Mumbai +1821 other locations',
 'Bangalore / Bengaluru +246 other locations',
 'Mumbai +1438 other locations',
 'Chennai +226 other locations',
 'Hyderabad / Secunderabad +331 other locations',
 'Hyderabad / Secunderabad +182 other locations',
 'Mumbai +254 other locations',
 'Mumbai +1499 other locations',
 'Mumbai +1895 other locations',
 'Bangalore / Bengaluru +177 other locations',
 'Bangalore / Bengaluru +518 other locations',
 'Noida +51 other locations',
 'Mumbai +1152 other locations',
 'Bangalore / Bengaluru +146 other locations',
 'Bangalore / Bengaluru +160 other locations']

### üìä **Create DataFrame from Lists**
**Function:** `pd.DataFrame()`  
**Purpose:** Converts extracted lists into structured tabular data  
**Columns Created:**
- `Company_Name` - Company name
- `Rating` - Rating score
- `No_Of_Reviews` - Review count
- `Company_Type` - Business category
- `Location` - Office locations


In [63]:
df = pd.DataFrame({
    'Company_Name': names,
    'Rating': rating,
    'No_Of_Reviews': No_Of_Reviews,
    'Company_Type': ctype,
    'Location': locations
})


In [64]:
df

Unnamed: 0,Company_Name,Rating,No_Of_Reviews,Company_Type,Location
0,TCS,3.3,(1.1L),IT Services & Consulting,Bangalore / Bengaluru +439 other locations
1,Accenture,3.7,(71.1k),IT Services & Consulting,Bangalore / Bengaluru +255 other locations
2,Wipro,3.6,(63.4k),IT Services & Consulting,Hyderabad / Secunderabad +370 other locations
3,Cognizant,3.6,(59.7k),IT Services & Consulting,Hyderabad / Secunderabad +229 other locations
4,Capgemini,3.7,(51.6k),IT Services & Consulting,Bangalore / Bengaluru +184 other locations
5,HDFC Bank,3.8,(50.6k),Banking,Mumbai +1821 other locations
6,Infosys,3.5,(47.2k),IT Services & Consulting,Bangalore / Bengaluru +246 other locations
7,ICICI Bank,4.0,(45.2k),Banking,Mumbai +1438 other locations
8,HCLTech,3.4,(44.5k),IT Services & Consulting,Chennai +226 other locations
9,Tech Mahindra,3.4,(42.3k),IT Services & Consulting,Hyderabad / Secunderabad +331 other locations


---

## üîÅ **Scaling Up: Multiple Pages**

### üîÑ **Loop Through All Pages**
**Strategy:** Iterate from page 1 to page 330 to collect all company data  
**Process:**
1. Generate URL for each page
2. Fetch and parse HTML
3. Extract company data
4. Append to combined list
5. Convert to final DataFrame


In [65]:
import pandas as pd
import requests 
from bs4 import BeautifulSoup



final = pd.DataFrame()

for j in range (1,501):
    url='https://www.ambitionbox.com/list-of-companies?page={}'.format(j)
    headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.3; Win 64 ; x64) Apple WeKit /537.36(KHTML , like Gecko) Chrome/80.0.3987.162 Safari/537.36'}
    webpage =requests.get(url,headers=headers).text
    soup = BeautifulSoup(webpage , 'lxml')
    company = soup.find_all('div', class_='companyCardWrapper')
   
    names=[]
    rating=[]
    No_Of_Reviews=[]
    ctype=[]
    locations=[]
    company_type=''
    location=''
    for i in company:

        names.append(i.find('h2').text.strip())
        rating.append(i.find('div', class_='rating_star_container').text.strip())
        No_Of_Reviews.append(i.find('span' , class_='companyCardWrapper__companyRatingCount').text.strip())
        misc_info = i.find('span', class_='companyCardWrapper__interLinking')
        if misc_info:
        # Split by "|" to separate type and location
            parts = misc_info.text.split('|')
            
            # Company Type (first part)
            company_type = parts[0].strip() if len(parts) > 0 else ''
            ctype.append(company_type)
            
            # Location (second part)
            location = parts[1].strip() if len(parts) > 1 else ''
            locations.append(location)
        else:
            ctype.append('')
            locations.append('')

    df = pd.DataFrame({
        'Company_Name': names,
        'Rating': rating,
        'No_Of_Reviews': No_Of_Reviews,
        'Company_Type': ctype,
        'Location': locations
    })

    final = pd.concat([final, df], ignore_index=True)

                    
            
    


In [66]:
final


Unnamed: 0,Company_Name,Rating,No_Of_Reviews,Company_Type,Location
0,TCS,3.3,(1.1L),IT Services & Consulting,Bangalore / Bengaluru +439 other locations
1,Accenture,3.7,(71.1k),IT Services & Consulting,Bangalore / Bengaluru +255 other locations
2,Wipro,3.6,(63.4k),IT Services & Consulting,Hyderabad / Secunderabad +370 other locations
3,Cognizant,3.6,(59.7k),IT Services & Consulting,Hyderabad / Secunderabad +229 other locations
4,Capgemini,3.7,(51.6k),IT Services & Consulting,Bangalore / Bengaluru +184 other locations
...,...,...,...,...,...
9995,Eduquity Career Technologies,3.4,(106),Recruitment,Bangalore / Bengaluru +14 other locations
9996,D Y Patil Hospital,3.9,(106),Healthcare,Mumbai +7 other locations
9997,Gulermak-Sam India,4.1,(106),Surat +9 other locations,
9998,CRY - Child Rights and You,4.0,(106),Non-Profit,Mumbai +9 other locations


In [67]:
df

Unnamed: 0,Company_Name,Rating,No_Of_Reviews,Company_Type,Location
0,Humana People to People India,4.2,(106),Non-Profit,New Delhi +32 other locations
1,Anonymous Content,4.1,(106),Pune +19 other locations,
2,Migsun Group,3.0,(106),Real Estate,Ghaziabad +7 other locations
3,Bharat Parenterals,2.9,(106),Pharma,Vadodara +9 other locations
4,Kelvion,3.8,(106),Industrial Machinery,Pune +12 other locations
5,QDVC,4.8,(106),Engineering & Construction,Doha +6 other locations
6,Eastern Book Company,3.2,(106),Printing & Publishing,Lucknow +9 other locations
7,Axiom Energy Conversion,3.1,(106),Power,Hyderabad / Secunderabad +15 other locations
8,Airtel X- Labs,3.1,(106),Gurgaon / Gurugram +6 other locations,
9,CIBERsites India,3.7,(106),IT Services & Consulting,Bangalore / Bengaluru +5 other locations


### üíæ **Save Data to CSV**
**Function:** `final.to_csv('ambitionbox_companies1.csv')`  
**Purpose:** Exports DataFrame to CSV file for future use  
**Output:** CSV file with 10,000+ company records  
**Use Cases:** Data analysis, ML models, business intelligence


In [None]:
final.to_csv('ambitionbox_companies1.csv')

---

## üéâ **Congratulations!**

You've learned **three powerful data collection methods**:

‚úÖ **Database Queries** - Structured data from SQL databases  
‚úÖ **API Integration** - JSON data from RESTful APIs  
‚úÖ **Web Scraping** - Data extraction from HTML websites

### üöÄ **Next Steps:**
- Data Cleaning & Preprocessing
- Exploratory Data Analysis (EDA)
- Feature Engineering
- Machine Learning Model Building

---

**Happy Data Science! üìäüêç**
