## Project: Data Minining Project for a non-profit organisation `UCOUNT`

----
## The CRISP-DM Framework

The CRISP-DM methodology provides a structured approach to planning a data mining project. It is a robust and well-proven methodology. 

- Business understanding (BU): Determine Business Objectives, Assess Situation, Determine Data Mining Goals, Produce Project Plan

- Data understanding (DU): Collect Initial Data, Describe Data, Explore Data, Verify Data Quality

- Data preparation (DP): Select Data, Clean Data, Construct Data, Integrate Data

- Modeling (M): Select modeling technique, Generate Test Design, Build Model, Assess Model

- Evaluation (E): Evaluate Results, Review Process, Determine Next Steps

- Deployment (D): Plan Deployment, Plan Monitoring and Maintenance, Produce Final Report, Review Project

References: 
1. [What is the CRISP-DM methodology?](https://www.sv-europe.com/crisp-dm-methodology/)

2. [Introduction to CRISP DM Framework for Data Science and Machine Learning](https://www.linkedin.com/pulse/chapter-1-introduction-crisp-dm-framework-data-science-anshul-roy/)

### Your Role
All of the tasks you are asked to do in this course assumes that you are leading a data science group for a non-profit (fake) company called UCOUNT (yoUr CONTribution). 

As a non-profit organisation, UCOUNT's main income is donation. Moreover, for some of the smaller projects, UCOUNT uses the Kickstart and other crowd funding platforms to raise funding. 

As data science group leader at UCOUNT, you are tasked to advise the marketing department such that they run an effective donation and kickstart campaigns. 

## Data Sets
You have two sets of data at your disposal. 

1. The first one is a collection of crowd funding campaigns performed in the past by different people and organisations at the [Kickstarter website](https://www.kickstarter.com/). The data is obtained using web scraper robots run by [webrobots](https://webrobots.io/kickstarter-datasets/). The bots crawled all Kickstarter projects and collect data which are then dumped in CSV format.

   You can download Kickstarter data set here: [kickstarter-cleaned.csv](https://github.com/flavianh/kickstarter-dash/blob/master/kickstarter-cleaned.csv)
   
   You will use this data to understand the important features of successful crowd funding campaigns. For example what are the common characteristics of successful campaigns?

2. The second data is a Census Income Data Set, which comes from [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Census+Income). The datset was donated by Ron Kohavi and Barry Becker, after being published in the article _"Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid"_. You can find the article by Ron Kohavi [online](https://www.aaai.org/Papers/KDD/1996/KDD96-033.pdf).

   You can download Census Income Data Set data here: [census-income-1996.csv](s3://10ac-courses-data/csv/census-income-1996.csv)
    
   You will use this data set to predict the annual income of a target donor. Understanding an individual's income can help a non-profit company like UCOUNT better understand how large of a donation to request, or whether or not they should reach out to begin with.  


## Business Understanding

----
This step mostly focuses on understanding the Business in all the different aspects. It follows the below different steps.

* Identify the goal and frame the business problem. 
* Gather information on resource, constraints, assumptions, risks etc
* Prepare Analytical Goal i.e. what type of performance metric and loss function to use
* Prepare Work Flow Chart

### Write the main objectives of this project in your words?
minimum of 100 characters

In [None]:
main_objectives = '''Add your answer text here
you can create python string using (') or (") or 3('), like the text here. The 3(') string can be used 
to write paragraphs, comments in the beginning of functions, etc.. Your answer to the above question 
should replace this text.
'''

In [None]:
assert len(main_objectives) > 100 

### What are the main business values to UCOUNT when you successfully complete the project?
minimum of 80 characters

In [None]:
bu_ucount = '''Add your answer text here
you can create python string using (') or (") or 3('), like the text here. The 3(') string can be used 
to write paragraphs, comments in the beginning of functions, etc.. Your answer to the above question 
should replace this text.
'''

In [None]:
assert len(bu_ucount) > 80 

### Outline the different data analysis steps you will follow to carry out the project
minimum of 100 characters

In [None]:
dm_outline = '''Add your answer text here
you can create python string using (') or (") or 3('), like the text here. The 3(') string can be used 
to write paragraphs, comments in the beginning of functions, etc.. Your answer to the above question 
should replace this text.
'''

In [None]:
assert len(dm_outline) > 100 

### What metrics will you use to measure the performance of your data analysis model? 
Write the equations of the metrics here

e.g. Precision = $\frac{TP}{(TP + FP)}$

Why do you choose these metrics? minimum of 100 characters

In [None]:
why_metrics = '''Add your answer text here
you can create python string using (') or (") or 3('), like the text here. The 3(') string can be used 
to write paragraphs, comments in the beginning of functions, etc.. Your answer to the above question 
should replace this text.
'''

In [None]:
assert len(why_metrics) > 100 

### How would you know if your data analysis work is a success or not?
minimum of 100 characters

In [None]:
how_success = '''Add your answer text here
you can create python string using (') or (") or 3('), like the text here. The 3(') string can be used 
to write paragraphs, comments in the beginning of functions, etc.. Your answer to the above question 
should replace this text.
'''

In [None]:
assert len(how_success) > 100 

### What kind of challenges do you expect in your analysis? 
List at least 3 challenges 

In [None]:
challenge_text = '''Add your answer text here
you can create python string using (') or (") or 3('), like the text here. The 3(') string can be used 
to write paragraphs, comments in the beginning of functions, etc.. Your answer to the above question 
should replace this text.
'''

In [None]:
assert len(challenge_text) > 80

_____________

# Download data

------
Have you already downloaded the two data sets into your local computer? If so you can skip the next part. If not, ncomment the final line from the next cell and run it.

>**Note:** Code and Markdown cells can be executed using the **Shift + Enter** keyboard shortcut. In addition, Markdown cells can be edited by typically double-clicking the cell to enter edit mode.


In [None]:
import requests
import os
import re

def get_filename_from_cd(cd):
    """
    Get filename from content-disposition
    """
    if not cd:
        return None
    fname = re.findall('filename=(.+)', cd)
    if len(fname) == 0:
        return None
    return fname[0]

def is_fsize_larger(header, size=200, unit=1e6): 
    # size is in units of 1e6 - MegaByte (Mb)
    
    content_length = header.get('content-length', None)
    
    if content_length and content_length > size*unit:  # 200 mb approx
        return False
    else:
        return True

def is_downloadable(url, size=None, unit=1e6):
    """
    Does the url contain a downloadable resource?
    To answer this question, we first fetch the headers 
    of a url before actually downloading it.
    This allows us to skip downloading files which 
    weren't meant to be downloaded    
    """    
    h = requests.head(url, allow_redirects=True)
    header = h.headers
    
    unitd = {1e3:'Kb', 1e6:'Mb', 1e9:'Gb', 1e12:'Tb', 1e16:'Pb'}[unit]
    
    dload = True 
    if size: #if size is given (not None)        
        dload = is_fsize_larger(header, size=size, unit=unit)
        
    if not dload: #file size condition
        print('file size of %s is more than %s %s'% (url, size, unitd) )
        return False
    
    #print('url=%s HEADER'%url)
    #print(header)
    
    content_type = header.get('content-type')    
    #if 'text' in content_type.lower(): #text condition
    #    print('%s is a link to plain text - not data'%url)
    #    return False
    if 'html' in content_type.lower(): #html condition
        print('%s is a link to plain html - not data'%url)
        return False
              
    return True

def make_dir_ifnew(filename, isdir=False):
    if os.path.exists(filename):
        return True
    
    if isdir: #if filename is directory path
        dirname = filename
    else: #filename is path+filename.ext
        dirname = os.path.dirname(filename)
        
    if not os.path.exists(dirname):
        os.makedirs(dirname)
        print('created directory: %s'%dirname)
    else:
        print('directory %s already exists.'%dirname)
        
    return False
        
def download_data(url, filename, replace=False):
        
    #make directory if it does not exist
    file_exists = make_dir_ifnew(filename)
    
    # only download if file doesn't exist or replace=True
    if replace or not file_exists: 
        dload = is_downloadable(url)              
        print('Is the url=%s downloadable: %s' % (url,dload) )

        if dload:
            #request HTTP GET
            r = requests.get(url)

            #write the data obtained to file
            open(filename, 'wb').write(r.content)

            print('url successfully downloaded to file: %s'%filename)
        else:
            print('url can not be downloaded')
    else:
        print('%s already exists - downloading not necessary'%filename)
              
    return filename

### Download data to local computer using the helper functions above

In [None]:
ks_filename='../data/kickstarter-cleaned.csv'
census_filename='../data/census-income-1996.csv' 

#download data if not done already
f=download_data('https://raw.githubusercontent.com/flavianh/kickstarter-dash/master/kickstarter-cleaned.csv', 
              ks_filename)
f=download_data('https://raw.githubusercontent.com/ramsaran-vuppuluri/finding_donors/master/census.csv', 
              census_filename)