<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 15px; height: 80px">

# Project 5:  Web Scraping
### Finding Underpriced RVs on Craigslist

![](https://snag.gy/WrdUMx.jpg)

In this project we will be practicing our web scraping skills.  You can use Scrapy or Python requests in order to complete this project.  It may be helpful to write some prototype code in this notebook to test your assumptions, then move it into a Python file that can be run from the command line.

> In order to run code from the command line, instead of the notebook, you just need to save your code to a file (with a .py extension), and run it using the Python interpreter:<br><br>
> `python my_file.py`

You will be building a process to scrape a single category of search results on Craigslist, that can easily be applied to other categories by changing the search terms.  The main goal is to be able to target and scrape a single page given a set of parameters.

**If you use Scrapy, provide your code in a folder.**

## Import your libraries for scrapy / requests / pandas / numpy / etc
Setup whichever libraries you need. Review past material for reference.

In [6]:
import pandas as pd
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
import requests
from bs4 import BeautifulSoup

CL = requests.get("http://lasvegas.craigslist.org/search/sss?query=rv")
HTML = CL.text  
HTML[0:150] 


u'\ufeff<!DOCTYPE html>\n\n<html class="no-js"><head>\n    <title>las vegas for sale &quot;rv&quot; - craigslist</title>\n\n    <meta name="description" content="'

In [None]:
http://www.imdb.com/title/tt0068646/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=2398042102&pf_rd_r=0R5QM449C7SZSE5HEFB0&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_tt_2
http://www.imdb.com/title/tt0111161/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=2398042102&pf_rd_r=0R5QM449C7SZSE5HEFB0&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_tt_1

In [7]:
print HTML

﻿<!DOCTYPE html>

<html class="no-js"><head>
    <title>las vegas for sale &quot;rv&quot; - craigslist</title>

    <meta name="description" content="las vegas for sale &quot;rv&quot; - craigslist">
    <meta http-equiv="X-UA-Compatible" content="IE=Edge"/>
    <link rel="canonical" href="https://lasvegas.craigslist.org/search/sss">
    <link rel="alternate" type="application/rss+xml" href="https://lasvegas.craigslist.org/search/sss?format=rss&amp;query=rv" title="RSS feed for craigslist | las vegas for sale &quot;rv&quot; - craigslist ">
    
    <link rel="next" href="https://lasvegas.craigslist.org/search/sss?s=100&amp;query=rv">
    <meta name="viewport" content="width=device-width,initial-scale=1">
    <link type="text/css" rel="stylesheet" media="all" href="//www.craigslist.org/styles/cl.css?v=c27e72792da0a56cfce19f5d49f8838b">
    <link type="text/css" rel="stylesheet" media="all" href="//www.craigslist.org/styles/search.css?v=af75b8cc1e45fb421a5858821aadf330">
    <link type="t

<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

## 1.  Scrape for the largest US cities (non-exhaustive list)
Search, research, and scrape Wikipedia for a list of the largest US cities.  There are a few sources but find one that is in a nice table.  We don't want all cities, just signifficant cities.  Examine your source.  Look for what can be differentiable.

- Use requests
- Build XPath query(ies)
- Extract to a list
- Clean your list

In [8]:
largest_cities = requests.get("https://en.wikipedia.org/wiki/List_of_Metropolitan_Statistical_Areas")
HTML_lc = largest_cities.text  
HTML[0:150] 

u'\ufeff<!DOCTYPE html>\n\n<html class="no-js"><head>\n    <title>las vegas for sale &quot;rv&quot; - craigslist</title>\n\n    <meta name="description" content="'

In [80]:
print HTML_lc
Selector(text=HTML).xpath('//li[@id="kiefer-views-per-hour"]/text()').extract()

<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>List of Metropolitan Statistical Areas - Wikipedia</title>
<script>document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );</script>
<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"List_of_Metropolitan_Statistical_Areas","wgTitle":"List of Metropolitan Statistical Areas","wgCurRevisionId":745054878,"wgRevisionId":745054878,"wgArticleId":9809857,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Metropolitan statistical areas of the United States","Core based statistical areas of the United States","Lists of metropolitan areas","United States demography-related lists"],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageContentModel":"wi

[]

In [97]:
cities = Selector(text=HTML_lc).xpath('//td/a/text()').extract()
cities

[u'Populous cities and metropolitan areas',
 u'New York-Newark-Jersey City, NY-NJ-PA Metropolitan Statistical Area',
 u'New York-Newark, NY-NJ-CT-PA Combined Statistical Area',
 u'Los Angeles-Long Beach-Anaheim, CA Metropolitan Statistical Area',
 u'Los Angeles-Long Beach, CA Combined Statistical Area',
 u'Chicago-Naperville-Elgin, IL-IN-WI Metropolitan Statistical Area',
 u'Chicago-Naperville, IL-IN-WI Combined Statistical Area',
 u'Dallas-Fort Worth-Arlington, TX Metropolitan Statistical Area',
 u'Dallas-Fort Worth, TX-OK Combined Statistical Area',
 u'Houston-The Woodlands-Sugar Land, TX Metropolitan Statistical Area',
 u'Houston-The Woodlands, TX Combined Statistical Area',
 u'Washington-Arlington-Alexandria, DC-VA-MD-WV Metropolitan Statistical Area',
 u'Washington-Baltimore-Arlington, DC-MD-VA-WV-PA Combined Statistical Area',
 u'Philadelphia-Camden-Wilmington, PA-NJ-DE-MD Metropolitan Statistical Area',
 u'Philadelphia-Reading-Camden, PA-NJ-DE-MD Combined Statistical Area',
 u'M

In [98]:
cities = pd.DataFrame(cities)
cities = cities[0].head(25)
cities

0                Populous cities and metropolitan areas
1     New York-Newark-Jersey City, NY-NJ-PA Metropol...
2     New York-Newark, NY-NJ-CT-PA Combined Statisti...
3     Los Angeles-Long Beach-Anaheim, CA Metropolita...
4     Los Angeles-Long Beach, CA Combined Statistica...
5     Chicago-Naperville-Elgin, IL-IN-WI Metropolita...
6     Chicago-Naperville, IL-IN-WI Combined Statisti...
7     Dallas-Fort Worth-Arlington, TX Metropolitan S...
8     Dallas-Fort Worth, TX-OK Combined Statistical ...
9     Houston-The Woodlands-Sugar Land, TX Metropoli...
10    Houston-The Woodlands, TX Combined Statistical...
11    Washington-Arlington-Alexandria, DC-VA-MD-WV M...
12    Washington-Baltimore-Arlington, DC-MD-VA-WV-PA...
13    Philadelphia-Camden-Wilmington, PA-NJ-DE-MD Me...
14    Philadelphia-Reading-Camden, PA-NJ-DE-MD Combi...
15    Miami-Fort Lauderdale-West Palm Beach, FL Metr...
16    Miami-Fort Lauderdale-Port St. Lucie, FL Combi...
17    Atlanta-Sandy Springs-Roswell, GA Metropol

In [99]:
cities = pd.DataFrame(cities)
cities.columns = ['city']
cities

Unnamed: 0,city
0,Populous cities and metropolitan areas
1,"New York-Newark-Jersey City, NY-NJ-PA Metropol..."
2,"New York-Newark, NY-NJ-CT-PA Combined Statisti..."
3,"Los Angeles-Long Beach-Anaheim, CA Metropolita..."
4,"Los Angeles-Long Beach, CA Combined Statistica..."
5,"Chicago-Naperville-Elgin, IL-IN-WI Metropolita..."
6,"Chicago-Naperville, IL-IN-WI Combined Statisti..."
7,"Dallas-Fort Worth-Arlington, TX Metropolitan S..."
8,"Dallas-Fort Worth, TX-OK Combined Statistical ..."
9,"Houston-The Woodlands-Sugar Land, TX Metropoli..."


In [113]:
#cities.city
#cities = cities['city'].apply(lambda x: pd.Series(x.split(',-')))
cities.columns = ['city', 'state']


In [139]:
cities = cities.drop_duplicates(subset=0)
cities = cities.drop([0])
cities = cities[0]

In [149]:
cities = pd.DataFrame(cities)
cities.loc[21] = 'Sfbay'
cities

Unnamed: 0,0
1,New York
3,Los Angeles
5,Chicago
7,Dallas
9,Houston
11,Washington
13,Philadelphia
15,Miami
17,Atlanta
19,Boston


<img src="http://imgur.com/xDpSobf.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

## 1.2 Only retain cities with properly formed ASCII

Optionally, filter out any cities with impropper ASCII characters.  A smaller list will be easier to look at.  However you may not need to filter these if you spend more time scraping a more concise city list.  This list should help you narrow down the list of regional Craigslist sites.

In [13]:
# ONLY RETAIN PROPERLY FORMED CITIES WITH FILTERING FUNCTION


<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

## 2.  Write a function to capture current pricing information via Craigslist in one city.
Choose a city from your scraped data, then go to the cooresponding city section on Craigslist, searching for "rv" in the auto section.  Write a method that pulls out the prices.

In [14]:
from bs4 import BeautifulSoup as soup

In [15]:
response = requests.get("http://newyork.craigslist.org/search/sss?sort=rel&query=rv")
HTML = response.text
Selector(text=HTML).xpath('//span[@class="price"]/text()').extract()

[u'$19999',
 u'$19999',
 u'$15000',
 u'$15000',
 u'$3000',
 u'$3000',
 u'$55',
 u'$55',
 u'$350',
 u'$350',
 u'$75',
 u'$75',
 u'$18000',
 u'$18000',
 u'$5800',
 u'$5800',
 u'$2000',
 u'$2000',
 u'$3000',
 u'$3000',
 u'$3000',
 u'$3000',
 u'$310',
 u'$310',
 u'$75',
 u'$75',
 u'$2507',
 u'$2507',
 u'$325',
 u'$325',
 u'$2000',
 u'$2000',
 u'$25',
 u'$25',
 u'$2000',
 u'$2000',
 u'$3020',
 u'$3020',
 u'$3002',
 u'$3002',
 u'$6800',
 u'$6800',
 u'$85',
 u'$85',
 u'$85',
 u'$85',
 u'$2000',
 u'$2000',
 u'$82500',
 u'$82500',
 u'$8900',
 u'$8900',
 u'$9000',
 u'$9000',
 u'$3025',
 u'$3025',
 u'$3010',
 u'$3010',
 u'$3020',
 u'$3020',
 u'$1700',
 u'$1700',
 u'$325',
 u'$325',
 u'$2000',
 u'$2000',
 u'$15000',
 u'$15000',
 u'$750',
 u'$750',
 u'$450',
 u'$450',
 u'$64000',
 u'$64000',
 u'$8995',
 u'$8995',
 u'$100',
 u'$100',
 u'$135000',
 u'$135000',
 u'$2300',
 u'$2300',
 u'$7',
 u'$7',
 u'$8',
 u'$8',
 u'$26000',
 u'$26000',
 u'$1',
 u'$1',
 u'$23012',
 u'$23012',
 u'$340',
 u'$340',
 u'$

In [151]:
class CraigslistSpider(scrapy.Spider):
    name = "craigs"
    allowed_domains = ["www.craigslist.org"]
    start_urls = (
        'https://portland.craigslist.org/search/rva?hints=static',
    )

    def parse(self, response):
        # process each restaurant link
        urls = response.xpath('//span[@class="pl"]/a[@class="hdrlnk"]/@href').extract()
        for url in urls:
            absolute_url = response.urljoin(url)
            request = scrapy.Request(
                absolute_url, callback=self.parse_movie)
            yield request

    def parse_post(self, response):
        price = response.xpath(
            '//span[@class="price"]/text()').extract_first()
        name = response.xpath(
            '//span[@id="titletextonly"]/text()').extract_first()
        neighborhood = response.xpath(
            '//span[@class="postingtitletext"]/small/text()').extract()
        movie = {
            'price': price,
            'name': name,
            'neighborhood': neighborhood,
            'url': response.url}
        yield movie

<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

## 2.1 Create a mapping of cities to cooresponding regional Craigslist URLs

Major US cities on Craigslist typically have their own cooresponding section (ie: SFBay Area, NYC, Boston, Miami, Seattle, etc).  Later, you will use these to query search results for various metropolitian regions listed on Craigslist.  Between the major metropolitan Craigslist sites, the only thing that will differ is the URL's that correspond to them.

The point of the "mapping":  Create a data structure that allows you to iterate with both the name of the city from Wikipedia, with the cooresponding variable that that will allow you to construct each craigslist URL for each region.

> For San Francsico (the Bay Area metropolitan area), the url for the RV search result is:
> http://sfbay.craigslist.org/search/sss?query=rv
>
> The convention is http://[region].craigslist.org/search/sss?query=rf
> Replacing [region] with the cooresponding city name will allow you to quickly iterate through each regional Craigslist site, and scrape the prices from the search results.  Keep this in mind while you build this "mapping".


<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

## 3. Define a function to caculate mean and median price per city.

Now that you've created a list of cities you want to scrape, adapt your solution for grabbing data in one region site, to grab data for all regional sites that you collected, then calculate the mean and median price of RV results from each city.

> Look at the URLs from a few different regions (ie: portland, phoenix, sfbay), and find what they have in common.  Determine the area in the URL string that needs to change the least, and figure out how to replace only that portion of the URL in order to iterate through each city.

In [26]:
response = requests.get("http://sfbay.craigslist.org/search/sss?sort=rel&query=rv")
SF = response.text
SF_price = Selector(text=SF).xpath('//span[@class="price"]/text()').extract()
SF_price = pd.DataFrame(SF_price)
SF_price['price'] = SF_price[0]
SF_price['price'] = SF_price['price'].apply(lambda x: x.replace('$', ' '))
SF_price.price.mean

<bound method Series.mean of 0       17999
1       17999
2       36999
3       36999
4       13999
5       13999
6       14900
7       14900
8       36999
9       36999
10       9999
11       9999
12        100
13        100
14       9999
15       9999
16      17999
17      17999
18        100
19        100
20          5
21          5
22      17999
23      17999
24      33016
25      33016
26         50
27         50
28       5000
29       5000
        ...  
112     28750
113     28750
114     16000
115     16000
116     15000
117     15000
118       145
119       145
120     10999
121     10999
122       100
123       100
124        60
125        60
126        90
127        90
128     39500
129     39500
130       700
131       700
132       200
133       200
134       230
135       230
136     15500
137     15500
138        90
139        90
140     33997
141     33997
Name: price, dtype: object>

In [32]:
SF_price = pd.DataFrame(SF_price)
SF_price['price'] = SF_price[0]
SF_price['price'] = SF_price['price'].apply(lambda x: x.replace('$', ' '))
SF_price.price = SF_price.price.astype(float)


Average selling price in SanFrancisco: 18580.1971831
Median selling price in SanFrancisco: 17999.0


In [18]:
portland_response = requests.get("http://portland.craigslist.org/search/sss?sort=rel&query=rv")
portland = portland_response.text
portland_price = Selector(text=portland).xpath('//span[@class="price"]/text()').extract()

In [19]:
portland_price = pd.DataFrame(portland_price)
portland_price['price'] = portland_price[0]
portland_price['price'] = portland_price['price'].apply(lambda x: x.replace('$', ' '))

In [31]:
portland_price.price = portland_price.price.astype(float)


Average selling price in Portland: 40933.9333333
Median selling price in Portland: 31999.0


In [33]:
print 'Average selling price in SanFrancisco:', SF_price.price.mean()
print 'Median selling price in SanFrancisco:', SF_price.price.median()
print 'Average selling price in Portland:', portland_price.price.mean()
print 'Median selling price in Portland:', portland_price.price.median()

Average selling price in SanFrancisco: 18580.1971831
Median selling price in SanFrancisco: 17999.0
Average selling price in Portland: 40933.9333333
Median selling price in Portland: 31999.0


<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

## 4. Run your scraping process, and save your results to a CSV file.

In [21]:
response = requests.get("http://www.portland.craigslist.org")
HTML = response.text  
HTML[0:150] 

u'<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">\n<html><head>\n<title>404 Not Found</title>\n</head><body>\n<h1>Not Found</h1>\n<p>The requested URL / w'

In [152]:
import scrapy
class craigslist(scrapy.Spider):
    name = 'portland_rv'
    allowed_domains = ['portland.craigslist.org']
    start_urls = (
        'https://portland.craigslist.org/search/rva?hints=static',
    )

    def parse(self, response):
        urls = response.xpath('//span/a/@href').extract()
        for url in urls:
            absolute_url = response.urljoin(url)
            request = scrapy.Request(
                absolute_url, callback=self.parse_craigslist)
            yield request

        next_page_url = response.xpath('//div/span/a[@class="button next"]/@href').extract_first()
        absolute_next_page_url = response.urljoin(next_page_url)
        request = scrapy.Request(absolute_next_page_url)
        yield request

    def parse_craigslist(self, response):
        title = response.xpath(
            '//span[@class="pl"]/a[@class="hdrlnk"]').extract_first()
        price = response.xpath(
            '//span[@class="l2"]/span[@class="price"]').extract_first()
        neighborhood = response.xpath(
            '//span[@class="l2"]/span[@class="pnr"]/small').extract_first()


<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

## 5. Do an analysis of the RV market.

Go head we'll wait.  Anything notable about the data?

In [None]:
import scrapy
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from RV.items import RvItem

class CraigslistSpider(scrapy.Spider):
    name = "craigsRV"
    allowed_domains = ["www.craigslist.org"]
    start_urls = (
        'https://portland.craigslist.org/search/rva',
    )

    def parse(self, response):
        urls = response.xpath('//span[@class="pl"]/a[@class="hdrlnk"]/@href').extract()
        for url in urls:
            absolute_url = response.urljoin(url)
            request = scrapy.Request(
                absolute_url, callback=self.parse_post)
            yield request

        next_page_url = response.xpath(
            '//a[@class="button next"]/@href').extract()
        abs_next_page_url = response.urljoin(next_page_url)
        request = scrapy.Request(abs_next_page_url)
        yield request

    def parse_post(self, response):
        price = response.xpath(
            '//span[@class="price"]/text()').extract_first()
        name = response.xpath(
            '//span[@id="titletextonly"]/text()').extract_first()
        neighborhood = response.xpath(
            '//span[@class="postingtitletext"]/small/text()').extract()
        post = {
            'price': price,
            'name': name,
            'neighborhood': neighborhood,
            'url': response.url}
        yield movie

<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

### 5.1 Does it makes sense to buy RVs in one region and sell them in another?

Assuming the cost of shipping or driving from one regional market to another.

Average and median price of RV in portland is about twice as must as the selling prices in San Francisco. Assuming the cost of shipping/driving from San Francisco to Portland should not be more than $1,000, it may be a decent business to start buying RV in San Francisco and moving them to Portland and selling them there for a  significant profit.

<img src="http://imgur.com/xDpSobf.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

### 5.2 Can you pull out the "make" from the markup and include that in your analyis?
How reliable is this data and does it make sense?

<img src="http://imgur.com/xDpSobf.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

### 5.3 Are there any other variables you could pull out of the markup to help describe your dataset?

<img src="http://imgur.com/xDpSobf.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

## 6. Move your project into scrapy (if you haven't used Scrapy yet)

>Start a project by using the command `scrapy startproject [projectname]`
> - Update your settings.py (review our past example)
> - Update your items.py
> - Create a spiders file in your `[project_name]/[project_name]/spiders` directory

You can update your spider class with the complete list of craigslist "start urls" to effectively scrape all of the regions.  Start with one to test.

Updating your parse method with the method you chose should require minimal changes.  It will require you to update your parse method to use the response parameter, and an item model (defined in items.py).

<img src="http://imgur.com/GCAf1UX.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

## 7.  Chose another area of Craigslist to scrape.

**Choose an area having more than a single page of results, then scrape multiple regions, multiple pages of search results and or details pages.**

This is the true exercise of being able to understand how to succesffuly plan, develop, and employ a broader scraping strategy.  Even though this seems like a challenging task, a few tweeks of your current code can make this very managable if you've pieced together all the touch points.  If you are still confused as to some of the milestones within this process, this is an excellent opportunity to round out your understanding, or help you build a list of questions to fill in your gaps.

_Use Scrapy!  Provide your code in this project directory when you submit this project._

## Optional: Interview Questions

---- 

### SQL Practice

1)
We have a deliveries table with 3000 rows

`SELECT * FROM deliveries; 
-- 3000 rows in set (0.05 sec) `

15 of those orders are from a customer with the customer_id_number=32

`SELECT * FROM deliveries WHERE customer_id_number = 32;
-- 15 rows in set (0.10 sec)`

Yet, when we SELECT the number of orders that are not from customer_id_number = 32, we only get 2960 results:

`SELECT * FROM deliveries WHERE customer_id_number <> 32;
-- 2960 rows in set (0.11 sec)`

**Question: What’s wrong? And why might this be the case? Modify your code to fix this. **

2) Construct the following tables:

`mysql> SELECT * FROM Employee;
+--------+----------+--------+
| emp_id | emp_name | salary |
+--------+----------+--------+
| 1      | James    |   2000 |
| 2      | Jack     |   4000 |
| 3      | Henry    |   6000 |
| 4      | Tom      |   8000 |
+--------+----------+--------+
4 rows IN SET (0.00 sec)`


`mysql> SELECT * FROM Department;
+---------+-----------+
| dept_id | dept_name |
+---------+-----------+
| 101     | Sales     |
| 102     | Marketing |
| 103     | Finance   |
| 104     | Accounting   |
+---------+-----------+
3 rows IN SET (0.00 sec)`


`mysql> SELECT * FROM Register;
+--------+---------+
| emp_id | dept_id |
+--------+---------+
|      1 |     101 |
|      2 |     102 |
|      3 |     103 |
|      4 |     102 |
+--------+---------+
4 rows IN SET (0.00 sec)`

** Questions: ** 
- Which employees belong to which department? Show this using one line of code (hint: more than one join) 
- What is the total marketing salary? 
- Using a join, can you show that there are no employees in accounting? 



3) Given an Employee table which has 3 fields – Id (Primary key), Salary and Manager Id, where manager id is the id of the employee that manages the current employee, find all employees that make more than their manager in terms of salary. Create the table and write the code that finds this


--- 
### Predictive Modeling

- What are some differences you would expect in a regression model that minimizes squared error, versus a model that minimizes absolute error? In which cases would each error  metric be appropriate?

- What error metric would you use to evaluate how good a binary classifier is? What if the classes are imbalanced?  What if there are more than 2 groups?

- What are various ways to predict a binary response variable? Can you compare two of them and tell me when one would be more appropriate? What’s the difference logistic regression and SVMs? 

- What is the difference between the loss functions used by SVMs and Logistic Regression? 

- What is R-squared? What are some other metrics that could be better than R-squared and why?

- You run your regression on different subsets of your data, and find that in each subset, the beta value for a certain variable varies wildly. What could be the issue here?


--- 
### Coding Questions 

- Given a sorted array and a number x, find a pair in array whose sum is closest to x. What is the time complexity of your algorithm? 
    
    `Examples:`
        Input: arr[] = {10, 22, 28, 29, 30, 40}, x = 54
        Output: 22 and 30

        Input: arr[] = {1, 3, 4, 7, 10}, x = 15
        Output: 4 and 10
        
- Check out this video on Linear Time Algorithm for finding the median: https://www.youtube.com/watch?v=_xntajCBLoE. Implement your version of this algorithm in Python. 

- Search in an almost sorted array: Given an array which is sorted, but after sorting some elements are moved to either of the adjacent positions, i.e., arr[i] may be present at arr[i+1] or arr[i-1]. Write an efficient function to search an element in this array. Basically the element arr[i] can only be swapped with either arr[i+1] or arr[i-1]. For example consider the array {2, 3, 10, 4, 40}, 4 is moved to next position and 10 is moved to previous position. [Hint: You can do this O(log n) time complexity]

    `Examples: `
        Input: arr[] =  {10, 3, 40, 20, 50, 80, 70}, key = 40
        Output: 2 
        Output is index of 40 in given array

        Input: arr[] =  {10, 3, 40, 20, 50, 80, 70}, key = 90
        Output: -1
        -1 is returned to indicate element is not present