# Life cycle of Data Science 
- > **1.Problem statement** 
- > **2.Data collection based on the problem statement**
- > **3.Data cleanning**
- > **4.Data analysis** 
- > **5.Data visualization**
- > **6.Feature engineering** 
- > **7.Model building** 
- > **8.Model evaluation** 
- > **9.Model deployment**

# Data collection based on the Problem statement
- > There are numerous methods for collecting data, among which web scraping is a notable technique. Let’s explore this process in a detailed, step-by-step manner to gain a comprehensive understanding.


- > **some websites offer data sets that are downloadable in Csv format or accessible via an Application Programming interface (API) But many websites with useful data don't offer these convienient options.**
- > **If we want to view data on these websites it has to be done from the website If we wanted to analyze this data or download it for use in some other app we wouldn't want to painstakingly copy paste everything, web scraping is a technique that lets us to use the programming to do the heavy lifting**
- > **we can write some code that looks at the website grabs just the data we want to work with and outputs it in the format we need**
- > **we can perform web scraping using Python requests and Beautiful soup library and then analyzing them using pandas and visualization libraries**

# What is web scrapping 
- > **Web scraping is the technique of automatically extracting data from websites. This process involves programming a script or using a tool to fetch web pages and extract information from them, mimicking the way a human would view and copy data from the site.**

- > **Web scraping is a method that simulates how a human would browse and collect data from websites. This process is automated by creating a bot, which can gather data much faster than a human. To create such a bot, typically two Python libraries are utilized:**

- > **Web scraping is an automatic method to collect large amounts of data from websites. Most of this data is unstructured data in an HTML format which is then converted into structured data in spreadsheets or databases so that it can be used in various applications**

![image.png](attachment:image.png)

In [None]:
!pip install requests

# Request Library 

- >**When you use requests to make a request, you are typically asking a web server to send you data stored on that server. This could be anything from a webpage, an image, an API response, or any other resource accessible via a URL (Uniform Resource Locator)**


- >**When you make a request using the requests library in Python, you are not copying the data on the website; rather, you are retrieving data from the web server or API that hosts the website or service**

- >**You usually make a request for information or data, and the API returns a response with what you requested. For example, every time you open Twitter or scroll down your Instagram feed, you’re basically making a request to the API behind that app and getting a response in return. This is also known as calling an API.**

![image.png](attachment:image.png)

**When you make a request to a server using the requests library in Python, the server responds with various types of status codes, depending on how it processed your request. These status codes are standardized and categorized into several classes, each giving you an idea of what happened with your request**

# 1xx: Informational Responses
- > 100 Continue: The initial part of a request has been received, and the client should continue with the rest of the request or ignore if it is already completed.
- > 101 Switching Protocols: The server is switching protocols, as requested by the client (e.g., switching to a newer HTTP version).

# 2xx: Successful Responses
- > 200 OK: The standard response for successful HTTP requests. The actual response will depend on the method used in the request.
- > 201 Created: This status code indicates that a new resource was successfully created in response to the request.
- > 204 No Content: The server successfully processed the request, but is not returning any content.

# 3xx: Redirection
- > 301 Moved Permanently: This and all future requests should be directed to the given URI.
- > 302 Found: This response code means that the resource temporarily resides at a different URI.
- > 304 Not Modified: Indicates that the resource has not been modified since the last request. Useful for caching purposes.

# 4xx: Client Errors
- > 400 Bad Request: The server could not understand the request due to invalid syntax.
- > 401 Unauthorized: The request has not been applied because it lacks valid authentication credentials for the target resource.
- > 403 Forbidden: The server understood the request but refuses to authorize it.
- > 404 Not Found: The server can't find the requested resource. In the browser, this means the URL is not recognized.
- > 429 Too Many Requests: The user has sent too many requests in a given amount of time ("rate limiting").

# 5xx: Server Errors
- > 500 Internal Server Error: A generic error message, given when no more specific message is suitable.
- > 502 Bad Gateway: The server was acting as a gateway or proxy and received an invalid response from the upstream server.
- > 503 Service Unavailable: The server is not ready to handle the request. Common causes are a server that is down for maintenance or that is overloaded.
- > 504 Gateway Timeout: The server was acting as a gateway or proxy and did not receive a timely response from the upstream server.

**When working with web servers, different types of HTTP methods can be used depending on the action you want to perform. Each method tells the server what you want to do with the resources identified by the URL. Here’s a clear and short description of the most commonly used HTTP methods:**

# 1. GET
- > **Purpose:** Retrieves data from a server at the specified resource. The GET method should only retrieve data and have no other effect on the data.
- > **Example Use:** Accessing a webpage, fetching data from an API without causing changes.

# 2. POST
- > **Purpose:** Sends data to the server to create a new resource. The POST data is included in the body of the request. This may be a complex object, form data, or files.
- > **Example Use:** Submitting form data, uploading a file, or creating a new record in a database via an API.

# 3. PUT
- > **Purpose:** Sends data to the server to update/replace an existing resource at a specific URL. Like POST, the data is contained in the body of the request.
- > **Example Use:** Updating a user profile information, replacing the details of an existing record.

# 4. DELETE
- > **Purpose:** Deletes the specified resource at the URL.
- > **Example Use:** Removing a user from a database, deleting a file or an article from a server.

# 5. PATCH
- > ***Purpose:** Similar to PUT, PATCH is used to apply partial modifications to a resource.
- > **Example Use:** Updating part of a resource, like changing a user's email address without affecting other data like their username.

# 6. HEAD
- > **Purpose:** Same as GET, but it only requests the headers that would be returned if the URL were requested with an HTTP GET method. This can be useful for checking what a GET request will return before actually making a GET request, or checking if a resource exists.
- **Example Use:** Checking the content-type or modification date of a resource without downloading the entire resource.

# 7. OPTIONS
- >**Purpose:** Describes the communication options for the target resource. It can be used to check the supported HTTP methods and other capabilities without performing any action.
- > **Example Use:** Determining which operations can be performed on a specific API endpoint.


# To install Request

In [2]:
# !pip install request

In [141]:
# We need to import the module before using it
import requests

In [38]:
# url of the web page that you are currently working on 
url = 'https://www.makaan.com/hyderabad-residential-property/rent-property-in-hyderabad-city?beds=1rk,1,2,3,3plus&propertyType=apartment,builder-floor,villa,independent-house,studio-apartment&budget=5000,16500'

In [39]:
response = requests.get(url)

# 1.text : If the content-type is  text
- **Description:** Returns the content of the response, in Unicode.
- **Usage:** Use this when you need the text representation of the response content (e.g., HTML or JSON text).
- **Example: response.text**

# 2.content : If the content type is image/binary form 
- **Description:** Returns the content of the response, in bytes.
- **Usage:** Use this for binary data or when you need to process the data as such (e.g., for images or files).
- **Example: response.content**

# 3.json(): If the content type is text
- **Description:** Converts the JSON response content to a dictionary if possible.
- **Usage:** Use this when the response is JSON to parse it easily into a Python dict.
- **Example: response.json()**


# 4.status_code
- **Description:** Returns the status code of the response.
- **Usage:** Use this to check if the request was successful (e.g., 200 is successful, 404 is not found).
- **Example: response.status_code**

# 5.headers
- **Description:** Returns a dictionary-like object allowing you to access response headers.
- **Usage:** Use this to access headers to see metadata like content type, server, date, etc.
- **Example: response.headers['Content-Type']**

                                                **PROPERTIES**

# 6.url
- **Description:** Returns the URL of the response.
- **Usage:** Useful for logging or handling redirects (to see where a request ended up).
- **Example: response.url**

# 7.encoding
- **Description:** Returns the encoding used to decode .text.
- **Usage:** Change it to alter how the text is decoded.
- **Example: response.encoding = 'utf-8'**

# 8.cookies
- **Description:** Returns a RequestsCookieJar of cookies used in the request or returned in the response.
- **Usage:** Use this to access cookies associated with the response which can be used for subsequent requests.
- **Example: response.cookies**

# 9.elapsed
- **Description:** Returns a timedelta object indicating the time elapsed from sending the request to the arrival of the response.
- **Usage:** Use this to measure request/response latency.
- **Example: response.elapsed**

# 10.raise_for_status()
- **Description:** Raises an HTTPError, if one occurred.
- **Usage:** Use this method to throw an exception if the request came back with an unsuccessful status code.
- **Example: response.raise_for_status()**

# 11.history
- **Description:** Returns a list of Response objects holding the history of request (a Response object is created for each redirect).
- **Usage:** This is useful for debugging and understanding the redirect chain.
- **Example: response.history**

In [18]:
response.status_code

200

In [27]:
response.reason

'OK'

In [28]:
response.headers

{'Date': 'Fri, 12 Apr 2024 06:44:21 GMT', 'Content-Type': 'text/html', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Server': 'nginx', 'Set-Cookie': 'websiteversion=new; Max-Age=946080000; Domain=.makaan.com; Path=/; Expires=Sun, 05 Apr 2054 06:32:27 GMT; HttpOnly, numberFormat=j%3A%7B%22code%22%3A%22%22%2C%22id%22%3A1%2C%22format%22%3A2%2C%22isConnectNow%22%3Atrue%2C%22seoLinks%22%3Atrue%7D; Path=/', 'originalUrl': '/hyderabad-residential-property/rent-property-in-hyderabad-city?beds=1rk,1,2,3,3plus&propertyType=apartment,builder-floor,villa,independent-house,studio-apartment&budget=5000,16500', 'skipAjaxyCaching': 'false', 'Cache-Control': 'public, max-age=7200', 'X-Frame-Options': 'SAMEORIGIN', 'X-XSS-Protection': '1', 'Content-Encoding': 'gzip'}

In [29]:
response.request.headers

{'User-Agent': 'python-requests/2.31.0', 'Accept-Encoding': 'gzip, deflate, br', 'Accept': '*/*', 'Connection': 'keep-alive'}

In [30]:
response.content

b'<!doctype html> <html lang="en"><head><meta http-equiv="Content-type" content="text/html; charset=utf-8"><title>Property for Rent in Hyderabad | 1584+ Properties on rent in Hyderabad</title><meta name="description" content="Find 100% Verified 1584+ Properties for Rent/Lease in Hyderabad on Makaan.com. Search &#10003;1011+ Flats for Rent/Lease. &#10003;281+ Houses/Villas for Rent. Visit Now!"><meta name="keywords" content="houses for rent in Hyderabad, rental flats in Hyderabad, apartments for rent in Hyderabad, flats for rent in Hyderabad, rent house in Hyderabad, rent property in Hyderabad, makaan, makaan.com"><meta name="theme-color" content="#fff" id="themeColor"><meta content="origin" name="referrer"><meta name="p:domain_verify" content="55ce01b3ca93c05fd5a41439a23dd0d9"><meta name="fb:pages" content="155462194517712"><meta name="country" content="India"><meta name="og:type" content="website"><meta name="og:site_name" content="Makaan.com"><meta name="og:image:url" content="http:/

In [33]:
response.headers.get("Content-Type")
response.content

b'<!doctype html> <html lang="en"><head><meta http-equiv="Content-type" content="text/html; charset=utf-8"><title>Property for Rent in Hyderabad | 1584+ Properties on rent in Hyderabad</title><meta name="description" content="Find 100% Verified 1584+ Properties for Rent/Lease in Hyderabad on Makaan.com. Search &#10003;1011+ Flats for Rent/Lease. &#10003;281+ Houses/Villas for Rent. Visit Now!"><meta name="keywords" content="houses for rent in Hyderabad, rental flats in Hyderabad, apartments for rent in Hyderabad, flats for rent in Hyderabad, rent house in Hyderabad, rent property in Hyderabad, makaan, makaan.com"><meta name="theme-color" content="#fff" id="themeColor"><meta content="origin" name="referrer"><meta name="p:domain_verify" content="55ce01b3ca93c05fd5a41439a23dd0d9"><meta name="fb:pages" content="155462194517712"><meta name="country" content="India"><meta name="og:type" content="website"><meta name="og:site_name" content="Makaan.com"><meta name="og:image:url" content="http:/

In [19]:
response.encoding

'ISO-8859-1'

In [21]:
response.json

<bound method Response.json of <Response [200]>>

In [26]:
response.raw.read()

b''

In [15]:
response.text

'<!doctype html> <html lang="en"><head><meta http-equiv="Content-type" content="text/html; charset=utf-8"><title>Property for Rent in Hyderabad | 1584+ Properties on rent in Hyderabad</title><meta name="description" content="Find 100% Verified 1584+ Properties for Rent/Lease in Hyderabad on Makaan.com. Search &#10003;1011+ Flats for Rent/Lease. &#10003;281+ Houses/Villas for Rent. Visit Now!"><meta name="keywords" content="houses for rent in Hyderabad, rental flats in Hyderabad, apartments for rent in Hyderabad, flats for rent in Hyderabad, rent house in Hyderabad, rent property in Hyderabad, makaan, makaan.com"><meta name="theme-color" content="#fff" id="themeColor"><meta content="origin" name="referrer"><meta name="p:domain_verify" content="55ce01b3ca93c05fd5a41439a23dd0d9"><meta name="fb:pages" content="155462194517712"><meta name="country" content="India"><meta name="og:type" content="website"><meta name="og:site_name" content="Makaan.com"><meta name="og:image:url" content="http://

In [40]:
response

<Response [200]>

In [41]:
response.text

'<!doctype html> <html lang="en"><head><meta http-equiv="Content-type" content="text/html; charset=utf-8"><title>Property for Rent in Hyderabad | 1584+ Properties on rent in Hyderabad</title><meta name="description" content="Find 100% Verified 1584+ Properties for Rent/Lease in Hyderabad on Makaan.com. Search &#10003;1011+ Flats for Rent/Lease. &#10003;281+ Houses/Villas for Rent. Visit Now!"><meta name="keywords" content="houses for rent in Hyderabad, rental flats in Hyderabad, apartments for rent in Hyderabad, flats for rent in Hyderabad, rent house in Hyderabad, rent property in Hyderabad, makaan, makaan.com"><meta name="theme-color" content="#fff" id="themeColor"><meta content="origin" name="referrer"><meta name="p:domain_verify" content="55ce01b3ca93c05fd5a41439a23dd0d9"><meta name="fb:pages" content="155462194517712"><meta name="country" content="India"><meta name="og:type" content="website"><meta name="og:site_name" content="Makaan.com"><meta name="og:image:url" content="http://

In [79]:
url = 'https://www.flipkart.com/search?q=mobiles&as=on&as-show=on&otracker=AS_Query_TrendingAutoSuggest_1_0_na_na_na&otracker1=AS_Query_TrendingAutoSuggest_1_0_na_na_na&as-pos=1&as-type=HISTORY&suggestionId=mobiles&requestId=7bf35a41-6c87-4372-8634-3457ba0c06ca'

In [80]:
response = requests.get(url)

In [81]:
response

<Response [500]>

# 500 Server error

- `requests.get(url,headers = {'User-Agent':'Mozilla/5.0'})`

- `headers = {"Accept-Language": "en-US,en;q=0.9"}`

- `page=requests.get(url, headers=headers)

In [82]:
request_header = {'Content-Type': 'text/html; charset=UTF-8','User-Agent': 'Chrome/101.0.0.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/119.0','Accept-Encoding': 'gzip, deflate, br'}

response = requests.get(url,headers =request_header)

In [83]:
response

<Response [200]>

In [84]:
response.text

'<!doctype html><html lang="en"><head><link href="https://rukminim2.flixcart.com" rel="preconnect"/><link rel="stylesheet" href="//static-assets-web.flixcart.com/fk-p-linchpin-web/fk-cp-zion/css/app_modules.chunk.09b0e9.css"/><link rel="stylesheet" href="//static-assets-web.flixcart.com/fk-p-linchpin-web/fk-cp-zion/css/app.chunk.e82689.css"/><meta http-equiv="Content-type" content="text/html; charset=utf-8"/><meta http-equiv="X-UA-Compatible" content="IE=Edge"/><meta property="fb:page_id" content="102988293558"/><meta property="fb:admins" content="658873552,624500995,100000233612389"/><link rel="shortcut icon" href="https://static-assets-web.flixcart.com/www/promos/new/20150528-140547-favicon-retina.ico"/><link type="application/opensearchdescription+xml" rel="search" href="/osdd.xml?v=2"/><meta property="og:type" content="website"/><meta name="og_site_name" property="og:site_name" content="Flipkart.com"/><link rel="apple-touch-icon" sizes="57x57" href="/apple-touch-icon-57x57.png"/><l

# Headers 

- > **HTTP headers are a critical part of HTTP requests and responses. They carry additional information (metadata) about the request or the response between the client and the server. Headers are used for various purposes such as specifying the content type, determining the behavior of the connection, managing authentication, and much more. Understanding headers is crucial when dealing with web servers, especially when you encounter errors like 403, 500, or 503.**

### Types of HTTP Headers

- > HTTP headers can be broadly classified into several types:

- > **Request Headers:** Include more information about the resource to be fetched or about the client requesting the resource. For example, User-Agent, Accept, Cookie.

- > **Response Headers:** Provide additional information about the response, such as its location or server type. Examples include Content-Type, Content-Length, Server.

- > **General Headers:** Can be used in both request and response messages but don't relate to the content of the message. Examples include Date, Cache-Control.

- > **Entity Headers:** Relate to the body of the resource, like Content-Length (response) and Content-Type.

### Common Headers and Their Uses

- > **User-Agent:** Identifies the client software making the request to the server, which can be a web browser or a bot. It helps the server to deliver content optimized for the specific browser or to manage bot traffic.

- > **Accept:** Specifies the media types (HTML, JSON, XML, etc.) that the client can process. This helps the server return data in a compatible format.

- > **Authorization:** Contains credentials for authenticating the client to the server. This is often used when accessing protected resources.

- > **Content-Type:** Indicates the type of data being sent by the client (in POST requests) or the type of data being returned by the server (in responses).

### Handling Common HTTP Errors with Headers

- > **403 Forbidden:** This status code means that the server understood the request but refuses to authorize it. If you're getting a 403 error, it could be due to the server blocking access based on the User-Agent header or missing authentication tokens in the Authorization header. Modifying the User-Agent to mimic a popular browser or providing proper authentication credentials can sometimes bypass this issue.

- > **500 Internal Server Error:** Indicates that the server encountered an unexpected condition which prevented it from fulfilling the request. This is typically a problem on the server side and not much can be done with headers from the client side. However, ensuring that the Content-Type and other entity headers correctly match the data being sent can help avoid triggering errors in how the server processes the request.

- > **503 Service Unavailable:** The server is currently unable to handle the request due to temporary overloading or maintenance. While headers generally don't resolve this issue, making use of the Retry-After response header (if provided by the server) can tell you when it might be appropriate to retry the request.

### 403 Forbidden error 
- `requests.get(url,headers = {'User-Agent':'Mozilla/5.0'})`

In [1]:
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'Accept-Language': 'en-US,en;q=0.9',
    'Accept-Encoding': 'gzip, deflate, br',
    'Referer': 'https://www.justdial.com/',
    'Connection': 'keep-alive'
}

### 500 / 503 Server error / Service Unavailable

- `headers = {"Accept-Language": "en-US,en;q=0.9"}`

- 'request_header = {'Content-Type': 'text/html; charset=UTF-8','User-Agent': 'Chrome/101.0.0.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/119.0','Accept-Encoding': 'gzip, deflate, br'}'


# Beautiful Soup
> **Beautiful Soup is a Python library for pulling data out of HTML and XML files.**

### What is a Parser?
- > A parser is a software component that takes input data (in this case, HTML or XML) and builds a data structure – often some form of the document object model (DOM). This structure represents the content in a way that programs can easily understand and manipulate. In the context of web scraping with BeautifulSoup, a parser transforms the raw HTML code of a webpage into a "soup" object that lets you extract specific pieces of data like tags, attributes, and texts systematically.

## Installing Beautiful Soup

In [68]:
# !pip install beautifulsoup4

Defaulting to user installation because normal site-packages is not writeable


In [142]:
from bs4 import BeautifulSoup

In [86]:
soup = BeautifulSoup(response.text, 'lxml')

In [88]:
print(soup.prettify())


<!DOCTYPE html>
<html lang="en">
 <head>
  <link href="https://rukminim2.flixcart.com" rel="preconnect"/>
  <link href="//static-assets-web.flixcart.com/fk-p-linchpin-web/fk-cp-zion/css/app_modules.chunk.09b0e9.css" rel="stylesheet"/>
  <link href="//static-assets-web.flixcart.com/fk-p-linchpin-web/fk-cp-zion/css/app.chunk.e82689.css" rel="stylesheet"/>
  <meta content="text/html; charset=utf-8" http-equiv="Content-type"/>
  <meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
  <meta content="102988293558" property="fb:page_id"/>
  <meta content="658873552,624500995,100000233612389" property="fb:admins"/>
  <link href="https://static-assets-web.flixcart.com/www/promos/new/20150528-140547-favicon-retina.ico" rel="shortcut icon"/>
  <link href="/osdd.xml?v=2" rel="search" type="application/opensearchdescription+xml"/>
  <meta content="website" property="og:type"/>
  <meta content="Flipkart.com" name="og_site_name" property="og:site_name"/>
  <link href="/apple-touch-icon-57x57.png" re

In [89]:
soup.find('div',class_='_4rR01T')

<div class="_4rR01T">Nothing Phone (2a) 5G (Black, 128 GB)</div>

In [154]:
url = 'https://www.flipkart.com/search?q=mobiles&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=on&as=off'

In [109]:
response = requests.get(url)

In [110]:
response

<Response [500]>

In [155]:
request_header = {'Content-Type': 'text/html; charset=UTF-8','User-Agent': 'Chrome/101.0.0.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/119.0','Accept-Encoding': 'gzip, deflate, br'}


response = requests.get(url,headers =request_header)

In [145]:
response

<Response [200]>