Skip to content

This Project is to analyse 30 GB user log data of KKBox, a music streaming service provider and predict the user churn rate.

Notifications You must be signed in to change notification settings

PranathiPeri/KKBox-Churn-Prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

KKBox-Churn-Prediction

For this project we have stored data in Amazon S3 and used Amazon EMR to run the clusters.

Following are the steps to follow: 1)Create/login to AWS(Amazon Web Services) 2)In the AWS menu select S3. Download the required files from the URL https://www.kaggle.com/c/kkbox-churn-prediction-challenge/data. 3)Create a new bucket with the name “s3kkbox” with versioning enabled and permissions set to public . 4)Since our dataset is very large we have uploaded the data into Amazon S3 using AWS CLI. 5)To Install AWS CLI following are the commands:

$ curl "https://s3.amazonaws.com/aws-cli/awscli-bundle.zip" -o "awscli-bundle.zip"
$ unzip awscli-bundle.zip
$ sudo ./awscli-bundle/install -i /usr/local/aws -b /usr/local/bin/aws

6)check if aws cli is installed by giving the command $aws

If the output is as follows then aws is succesfully installed. Output: usage: aws [options] [ ...] [parameters] To see help text, you can run:

aws help
aws <command> help
aws <command> <subcommand> help
aws: error: too few arguments

7)Configure aws tool access with the help of following commands: Replace with your AWS Access Key ID and AWS Secret Access Key which can be found in Your Security Credentials on the AWS Console. $ aws configure

AWS Access Key ID: AWS Secret Access Key: Default region name [us-west-2]: us-west-2 Default output format [None]: json

8)To upload the large files we have to use the following command: Traverse to the location where your files are present run these commands in the terminal accordingly.

To upload single file at a time: $ aws s3 cp localfilename s3location e.g:$ aws s3 cp transcations.csv s3://s3kkbox

To sync a folder to s3 bucket $ aws s3 sync . s3://my-bucket/my-dir/ e.g:$ aws s3 sync . s3://s3kkbox

With the help of above commands we have uploaded data to the Amazon S3 bucket.

9)Generate key value pair by following the instructions in the link http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html

10)In the aws menu select EMR. In the clusters menu ,select 'Create Cluster'.

11)In the Quick Cluster Configuration set the following :

Cluster name :My Cluster Logging :Enabled S3 Folder : select the bucket s3kkbox which is created in the previous step. Launchmode: Cluster Release : emr-5.10.0 Applications:Spark: Spark 2.2.0 on Hadoop 2.7.3 YARN with Ganglia 3.7.2 and Zeppelin 0.7.3 Instance Type:m4.xlarge Number of Instances: 4 EC2key pair: Select the EC2 key pair which is created in the above step. Then select 'Create cluster'

This step creates the cluster with required master and slave nodes.

12)When the cluster is started in the web connections Zeppelin is enabled.

13)Open Zeppelin in a new tab and then using import notebook option, link the .json files (which are submitted) to the cluster

About

This Project is to analyse 30 GB user log data of KKBox, a music streaming service provider and predict the user churn rate.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published