## Web Scrapping Exercises

### Problem 2

From the following page of the moneycontrol website extract all the recommendations given by the recommenders. Arrange the scrapped data as Date, Recommenders Name, Stock Name, Target, Action, etc.
https://www.moneycontrol.com/broker-research/markets/cash-200.html

Extend the problem to extract data from multiple webpages, e.g. from 200 to 204

### Libraries

In [1]:
#required libraries for web scrap
from urllib.request import urlopen  #to get access of webpage
from bs4 import BeautifulSoup       #to get html content

### Extracting Data From a Single Page

If you take a look at the webpage you will find that the recommendations are small horizontal segments. if you take a look at the HTML script (using inspect element on the webpage) you will find that these are encapsulated using the div tag of HTML. Note carefully, here we are specifically interested in the 'div' tags with the class 'Ohidden'. Inside that structure there are two lines tagged with 'p'. Those are what contains the required information.

In [3]:
mc_html = urlopen('https://www.moneycontrol.com/broker-research/markets/cash-200.html')
mc_soup = BeautifulSoup(mc_html, 'html.parser')

In [11]:
div = mc_soup.find_all('div', {'class':'Ohidden'})


count = 0

for i in div:       
    try:
        print(i.find('p', {'class':'op_gl12'}).get_text())
        print(i.find('p', {'class':'MT5'}).get_text())
        print('\n')
    except:
        pass
    
    count += 1
    if count == 5:
        break

cash |  Dec 28, 2016 | 03:12 pm
Buy Karnataka Bank; target of Rs 136: Arihant Capital 


cash |  Dec 26, 2016 | 04:57 pm
Hold Maharashtra Seamless; target of Rs 240: ICICI Direct 


cash |  Dec 26, 2016 | 04:54 pm
Buy West Coast Paper Mills; target of Rs 169: AUM Capital 


cash |  Dec 26, 2016 | 04:52 pm
Buy Tube Investments; target of Rs 719: AUM Capital 


cash |  Dec 26, 2016 | 04:51 pm
Buy Torrent Pharma; target of Rs 1665: Edelweiss 




**Note:**

1. We need to extract only the date from line 1
2. The First word on line 2 contains the action followed by the stock name
3. The middle portion (between the semi-colon and the colon) contains the target
4. The last part of the last line contains the Recommender's name

In [39]:
#Let us parse these information and store them in different Lists

div = mc_soup.find_all('div', {'class':'Ohidden'})

date = []
action = []
stock = []
target = []
recommender = []

for i in div:       
    try:
        #Getting the date
        line1 = i.find('p', {'class':'op_gl12'}).get_text()
        line1 = line1.lstrip('cash | ')
        date.append(line1[: line1.find("|")-1])
        
        line2 = i.find('p', {'class':'MT5'}).get_text().strip()
        action.append(line2[:line2.find(" ")])
        stock.append(line2[line2.find(" ")+1:line2.find(";")])
        target.append(float(line2[line2.find("Rs")+3: line2.find(":", line2.find("Rs"))]))
        recommender.append(line2[line2.find(':')+1:].strip())
        
    except:
        pass


In [40]:
#Creating a Data Frame
import pandas as pd

dic = {'Date':date, 'Stock':stock, 'Action':action, 'Target':target, 'Recommender':recommender}
moneyControl = pd.DataFrame(dic)

In [41]:
moneyControl = moneyControl[['Date','Stock','Action','Target','Recommender']]
moneyControl

Unnamed: 0,Date,Stock,Action,Target,Recommender
0,"Dec 28, 2016",Karnataka Bank,Buy,136.0,Arihant Capital
1,"Dec 26, 2016",Maharashtra Seamless,Hold,240.0,ICICI Direct
2,"Dec 26, 2016",West Coast Paper Mills,Buy,169.0,AUM Capital
3,"Dec 26, 2016",Tube Investments,Buy,719.0,AUM Capital
4,"Dec 26, 2016",Torrent Pharma,Buy,1665.0,Edelweiss
5,"Dec 26, 2016",TVS Srichakra,Hold,3175.0,Centrum
6,"Dec 21, 2016",Kaveri Seed,Buy,489.0,Motilal Oswal
7,"Dec 21, 2016",Aegis Logistics,Accumulate,167.0,CD Equisearch
8,"Dec 21, 2016",Persistent Systems,Hold,690.0,ICICI Direct
9,"Dec 21, 2016",Jammu & Kashmir Bank,Buy,75.0,ICICI Direct


### Extracting Data From a Single Page (Function)

In [44]:
#This was from page no. 200
#Create a function to scrap the data from any page number. (input should be a page number)

#Required Libraries
import pandas as pd
from urllib.request import urlopen  #to get access of webpage
from bs4 import BeautifulSoup       #to get html content

#Functions
def moneyControlBroker(pageNo):
    mc_html = urlopen('https://www.moneycontrol.com/broker-research/markets/cash-'+str(pageNo)+'.html')
    mc_soup = BeautifulSoup(mc_html, 'html.parser')
    
    div = mc_soup.find_all('div', {'class':'Ohidden'})

    date = []
    action = []
    stock = []
    target = []
    recommender = []

    for i in div:       
        try:
            #Getting the date
            line1 = i.find('p', {'class':'op_gl12'}).get_text()
            line1 = line1.lstrip('cash | ')
            date.append(line1[: line1.find("|")-1])

            line2 = i.find('p', {'class':'MT5'}).get_text().strip()
            action.append(line2[:line2.find(" ")])
            stock.append(line2[line2.find(" ")+1:line2.find(";")])
            target.append(float(line2[line2.find("Rs")+3: line2.find(":", line2.find("Rs"))]))
            recommender.append(line2[line2.find(':')+1:].strip())

        except:
            pass

    
    dic = {'Date':date, 'Stock':stock, 'Action':action, 'Target':target, 'Recommender':recommender}
    moneyControl = pd.DataFrame(dic)
    
    moneyControl = moneyControl[['Date','Stock','Action','Target','Recommender']]
    return(moneyControl)

In [62]:
moneyControlBroker(201)

Unnamed: 0,Date,Stock,Action,Target,Recommender
0,"Dec 16, 2016",Endurance Technologies,Buy,715.0,Motilal Oswal
1,"Dec 16, 2016",Bajaj Finance,Accumulate,1030.0,KR Choksey
2,"Dec 16, 2016",J Kumar Infra,Buy,300.0,Axis Direct
3,"Dec 15, 2016",PNC Infratech,Buy,128.0,ICICI Direct
4,"Dec 15, 2016",Pennar Engineered,Buy,191.0,Centrum
5,"Dec 15, 2016",L&T,Buy,800.0,Motilal Oswal
6,"Dec 15, 2016",DEN Networks,Hold,93.0,Edelweiss
7,"Dec 15, 2016",Coal India,Buy,310.0,Edelweiss
8,"Dec 15, 2016",Nalco,Buy,77.0,Motilal Oswal
9,"Dec 14, 2016",V-Guard,Hold,191.0,Geojit BNP Paribas


### Extracting Data From Multiple Pages

In [63]:
#Extracting the data from page number 200 to 204

data = pd.DataFrame()
for page in range(200,205):
    df = moneyControlBroker(page)
    data = pd.concat([data, df])

In [None]:
data.shape

In [61]:
data

Unnamed: 0,Date,Stock,Action,Target,Recommender
0,"Dec 28, 2016",Karnataka Bank,Buy,136.0,Arihant Capital
1,"Dec 26, 2016",Maharashtra Seamless,Hold,240.0,ICICI Direct
2,"Dec 26, 2016",West Coast Paper Mills,Buy,169.0,AUM Capital
3,"Dec 26, 2016",Tube Investments,Buy,719.0,AUM Capital
4,"Dec 26, 2016",Torrent Pharma,Buy,1665.0,Edelweiss
5,"Dec 26, 2016",TVS Srichakra,Hold,3175.0,Centrum
6,"Dec 21, 2016",Kaveri Seed,Buy,489.0,Motilal Oswal
7,"Dec 21, 2016",Aegis Logistics,Accumulate,167.0,CD Equisearch
8,"Dec 21, 2016",Persistent Systems,Hold,690.0,ICICI Direct
9,"Dec 21, 2016",Jammu & Kashmir Bank,Buy,75.0,ICICI Direct
