## Web Scraping for fetching the Fund Information:

**Running the below python code for scraping the data from www.morningstar.com takes very long time 
(~ 2 hour 30 mins) for approximately 1277 funds of 10 fund families. Hence the scraped data is exported in csv files 
and further data analysis in another python file is performed by reading the csv files for faster execution.**

1. Import Python Libraries

In [14]:
# Import Section
# special IPython command to prepare the notebook for matplotlib
%matplotlib inline 
import requests
import urllib.request
import numpy as np
import pandas as pd                               # pandas
import matplotlib.pyplot as plt                   # module for plotting 
import datetime as dt                             # module for manipulating dates and times
from collections import OrderedDict

# Import scipy library
import scipy as sp

import seaborn as sns

# Import module matlabplot for visaulizations
from matplotlib import pyplot as plt
from matplotlib import rcParams
import math

# Import Beautiful Soup library
import bs4

2. List of Top 10 Fund Families based on largest Asset Under Management (AUM):
This List includes Fund Families like Vanguard, American Funds, PIMCO, Fidelity Investments, Franklin Templeton Investments, BlacRock, T. Rowe Price, J.P.Morgan Funds, Oppenheimer Funds and Columbia. We will be fetching the data for the funds from the Morningstar website. Morningstar is one of the biggest source of information for almost all of the financial securities information. We will be using Web Scraping to fetch the Information for list of all fund families and it's funds. The fund's returns information and other important parameters for analysis.

In [8]:
# Defining a DataFrame for Fund Family which contains Fund Family and the Fund Family URL.
FFamily = pd.DataFrame(columns=['Fund_Family','MorningstarURL'])

# List of 10 largest Mutual Fund Families
FFamily.loc[0] = ['Vanguard','http://quicktake.morningstar.com/fundfamily/vanguard/0C00001YUF/fund-list.aspx']
FFamily.loc[1] = ['American Funds','http://quicktake.morningstar.com/fundfamily/american-funds/0C00001YPH/fund-list.aspx']
FFamily.loc[2] = ['PIMCO Funds','http://quicktake.morningstar.com/fundfamily/pimco/0C00004ALK/fund-list.aspx']
FFamily.loc[3] = ['T. Rowe Price','http://quicktake.morningstar.com/fundfamily/t-rowe-price/0C00001YZ8/fund-list.aspx']
FFamily.loc[4] = ['JP Morgan','http://quicktake.morningstar.com/fundfamily/jpmorgan/0C00001YRR/fund-list.aspx']
FFamily.loc[5] = ['Fidelity Investments','http://quicktake.morningstar.com/fundfamily/fidelity-investments/0C00001YR0/fund-list.aspx']
FFamily.loc[6] = ['Franklin Templeton Investments','http://quicktake.morningstar.com/fundfamily/franklin-templeton-investments/0C00004AKN/fund-list.aspx']
FFamily.loc[7] = ['BlackRock','http://quicktake.morningstar.com/fundfamily/blackrock/0C000034YC/fund-list.aspx']
FFamily.loc[8] = ['Columbia','http://quicktake.morningstar.com/fundfamily/columbia/0C00001YQG/fund-list.aspx']
FFamily.loc[9] = ['Oppenheimer Funds','http://quicktake.morningstar.com/fundfamily/oppenheimerfunds/0C00001YZF/fund-list.aspx']


In [9]:
FFamily

Unnamed: 0,Fund_Family,MorningstarURL
0,Vanguard,http://quicktake.morningstar.com/fundfamily/va...
1,American Funds,http://quicktake.morningstar.com/fundfamily/am...
2,PIMCO Funds,http://quicktake.morningstar.com/fundfamily/pi...
3,T. Rowe Price,http://quicktake.morningstar.com/fundfamily/t-...
4,JP Morgan,http://quicktake.morningstar.com/fundfamily/jp...
5,Fidelity Investments,http://quicktake.morningstar.com/fundfamily/fi...
6,Franklin Templeton Investments,http://quicktake.morningstar.com/fundfamily/fr...
7,BlackRock,http://quicktake.morningstar.com/fundfamily/bl...
8,Columbia,http://quicktake.morningstar.com/fundfamily/co...
9,Oppenheimer Funds,http://quicktake.morningstar.com/fundfamily/op...


3. Fund Family with Ticker:
The below section contains code to fetch the list of Funds within each family. The web scraping from the Morningstar URL is done using Beautiful Soup. Ticker is a Unique idenitifer for a particular security similar to CUSIP and is a 5 character word(basically a combination of initials for the fund's name). However ticker makes more sense as it is more related to the Fund's family and name.

E.g. Fidelity funds have ticker starting with F whereas Vanguard funds have ticker starting with V.

In [16]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

# Create the Funds Family DataFrame
Funds_family = pd.DataFrame(columns=['Fund_Name', 'Fund_Family', 'Fund_Ticker'])
i = 0

for index in range(0, len(FFamily)):
    # For each fund, fetch the MorningStar URL
    contenturl = FFamily.MorningstarURL[index]
    
    req = requests.get(contenturl)
    page = req.content

    # Using Beautiful Soup to parse the HTML content
    soup = BeautifulSoup(page, 'html.parser')

    # Extract the information from the div which contains the class "syn_section_b1"
    table = soup.find("div", {"class": "syn_section_b1"})

    # Loop to fetch all the fund tickers and its's names which are contained within the href section of the URL
    for row in table.findAll('a'):
        # If we carefully observe the URL, the ticker information starts at 73
        if (row['href'][73:]) != '':
            Funds_family.loc[i] = [row.contents[0], FFamily['Fund_Family'][index], row['href'][73:]]
            i = i + 1


AttributeError: 'NoneType' object has no attribute 'findAll'