## Preamble

### Are You in the Right Place?

The following is part of a multi-part introduction to data science for those in the legal profession. The full collection of materials can be found on the Suffolk LIT Lab's How To page under the heading [Demystify Data Science](http://suffolklitlab.org/howto/#demystified). If you've followed the instructions found at [Codifying Gut Decisions](http://suffolklitlab.org/howto/demystified/4/), you should have this notebook running as opposed to viewing a preview. That means that you can run the code fond below, not to mention, read and write files to your own copy of this library.

### A Quick Test/How To
To run the code in a given cell (one of the gray boxes), make sure that it has focus (i.e., is highlighted by a bounding box), then click the "Run" button in the menu above. Alternatively, you can press `Shift+Enter`. To give a cell focus, just click on the cell. Lets give it a try. **Run the cell below.**

In [64]:
print("Yay! It worked.")

Yay! It worked.


If the text "Yay! It worked." appeared after the cell, it worked. Yay! FYI, you are welcome to join the LIT Lab's Slack Team. There you can ask and answer questions relating to this lesson under the [#howto-datasci](https://suffolklitlab.slack.com/messages/CAKMYBRL0/) channel. See the Lab's [How To](file:///H:/LITLab/SuffolkLITLab.github.io/howto/) for more. That being said, let's get to the main course. 

As you come upon cells, run them. FYI, the text blocks are actual cells too. So it's perfectly reasonable to press `Shift+Enter` to move your way down the page. If you want to see how we format text, double click on one of the text blocks and you'll see something called [markdown](https://en.wikipedia.org/wiki/Markdown). You can set a cell to "Code" or "Markdown" in a pulldown menu above. We're not going to do anything with Markdown here, but I thought you'd like to know. Anywho, to convert the Markdown back to text, just run the cell.  

# The Classifier Challenge.

First, let's load Pandas.

In [69]:
import pandas as pd

Now for the good stuff. The law firm of Dewey Cheatem and Howe is a medium sized personal injury firm, and they've provided you with two data files (people.csv and calls.csv). 

In [66]:
# Note: you may need to edit the code below to deal with dates and the like. 

people = pd.read_csv('people.csv') 
calls = pd.read_csv('calls.csv') 

display(people.head())
display(calls.head())


single_table = people.merge(calls) 
print ("row count:",len(single_table))
single_table.head()



Unnamed: 0.1,Unnamed: 0,person_id,name,sex,date of birth
0,0,100,Penelope Lewis,female,1990-08-31
1,1,101,David Anthony,male,1971-10-14
2,2,102,Ida Shipp,female,1962-05-24
3,3,103,Joanna Moore,female,2017-03-10
4,4,104,Lisandra Ortiz,female,2020-08-05


Unnamed: 0.1,Unnamed: 0,call_id,person_ID,Referal Soure,attorney,body_part_1,body_part_2,body_part_3,body_part_4,body_part_5,surgery,injury_date,intake,take
0,0,136,100,Facebook,Patty Hewes,Head,,,,,yes,2015-11-27,2017-02-06,no
1,1,137,101,referal,Patty Hewes,hands,elbow,,,,yes,2014-06-30,2017-02-07,No
2,2,138,102,,Rachel Zane,hip,arm,foot/feet,,,NO,2016-04-22,2017-02-07,no
3,3,139,103,Google,Perry Mason,EYE,ankle,,,,no,2015-08-12,2017-02-08,yes
4,4,140,104,website,Rusty Sabich,neck,,,,,YES,2016-12-28,2017-02-09,no


row count: 100000


Unnamed: 0.1,Unnamed: 0,person_id,name,sex,date of birth,call_id,person_ID,Referal Soure,attorney,body_part_1,body_part_2,body_part_3,body_part_4,body_part_5,surgery,injury_date,intake,take
0,0,100,Penelope Lewis,female,1990-08-31,136,100,Facebook,Patty Hewes,Head,,,,,yes,2015-11-27,2017-02-06,no
1,1,101,David Anthony,male,1971-10-14,137,101,referal,Patty Hewes,hands,elbow,,,,yes,2014-06-30,2017-02-07,No
2,2,102,Ida Shipp,female,1962-05-24,138,102,,Rachel Zane,hip,arm,foot/feet,,,NO,2016-04-22,2017-02-07,no
3,3,103,Joanna Moore,female,2017-03-10,139,103,Google,Perry Mason,EYE,ankle,,,,no,2015-08-12,2017-02-08,yes
4,4,104,Lisandra Ortiz,female,2020-08-05,140,104,website,Rusty Sabich,neck,,,,,YES,2016-12-28,2017-02-09,no


These files contain info on potential client calls, including info about the callers and whether or not the firm took the case (this last bit is in the column "take"). This challenge is called Codifying Gut Decisions beacue the decision to take a case is one the firm currently makes based on a gut feeling. You're job is to turn that process into an algorithm.

Your mission should you choose to accept it: (1) clean the data; (2) do some feature engineering if you think it's needed; and (3) train a classifier that predicts if the firm will take a case. That is, predict when the `take` column would have a "yes" in it. 

I'll give an "prize" to the person who can build the best classifier (**Lab students only**), where the best is defined as the classifier with the highest F1 score when run on a set of new data that your model wasn't trained on. Namely, this data:

In [67]:
# Note: you may need to edit the code below to deal with dates and the like. 
# Also, there is no `take` column since that's what you're trying to predict. ;)

people_comp = pd.read_csv('z_people.csv') 
calls_comp = pd.read_csv('z_calls.csv') 

display(people_comp.head())
display(calls_comp.head())

single_table = people_comp.merge(calls_comp) 
print ("row count:",len(single_table))
single_table.head()



Unnamed: 0.1,Unnamed: 0,person_id,name,sex,date of birth
0,0,100,Jerry Filer,male,1999-06-29
1,1,101,Richard Myers,male,1965-11-26
2,2,102,Mable Bacon,female,1956-06-19
3,3,103,Debbie Harty,female,2024-02-23
4,4,104,Shirley Pena,female,2040-07-25


Unnamed: 0.1,Unnamed: 0,call_id,person_ID,Referal Soure,attorney,body_part_1,body_part_2,body_part_3,body_part_4,body_part_5,surgery,injury_date,intake
0,0,136,100,website,Atticus Finch,hand,Neck,,,,no,2016-11-11,2017-05-22
1,1,137,101,referal,Perry Mason,Neck,,,,,yes,2015-05-09,2017-05-22
2,2,138,102,website,Rachel Zane,ankle,neck,hip,,,no,2016-11-10,2017-05-22
3,3,139,103,Other,Rusty Sabich,arm,,,,,no,2015-04-09,2017-05-22
4,4,140,104,,,neck,,,,,yes,2014-08-15,2017-05-23


row count: 1000


Unnamed: 0.1,Unnamed: 0,person_id,name,sex,date of birth,call_id,person_ID,Referal Soure,attorney,body_part_1,body_part_2,body_part_3,body_part_4,body_part_5,surgery,injury_date,intake
0,0,100,Jerry Filer,male,1999-06-29,136,100,website,Atticus Finch,hand,Neck,,,,no,2016-11-11,2017-05-22
1,1,101,Richard Myers,male,1965-11-26,137,101,referal,Perry Mason,Neck,,,,,yes,2015-05-09,2017-05-22
2,2,102,Mable Bacon,female,1956-06-19,138,102,website,Rachel Zane,ankle,neck,hip,,,no,2016-11-10,2017-05-22
3,3,103,Debbie Harty,female,2024-02-23,139,103,Other,Rusty Sabich,arm,,,,,no,2015-04-09,2017-05-22
4,4,104,Shirley Pena,female,2040-07-25,140,104,,,neck,,,,,yes,2014-08-15,2017-05-23


Your models may use different features than your peer's. So to score your model I'll be asking you to provide a simple list of `call_id`s where your model predicts if the firm will take the case. That is, for the new data, provide a list of the cases you think the firm would take. This list should have each call_id seperated by a comma, and you can submit them here:

https://docs.google.com/forms/d/e/1FAIpQLSemwYALjIlwluyd1JgY3tMgBK1Pz2kFyy5DnqGrgRDuhx7jPQ/viewform

**Hint:** If you want to turn the column from a dataframe into a comma-seperated list, you can use the following syntax: 

In [68]:
calls_comp["call_id"].tolist()


[136,
 137,
 138,
 139,
 140,
 141,
 142,
 143,
 144,
 145,
 146,
 147,
 148,
 149,
 150,
 151,
 152,
 153,
 154,
 155,
 156,
 157,
 158,
 159,
 160,
 161,
 162,
 163,
 164,
 165,
 166,
 167,
 168,
 169,
 170,
 171,
 172,
 173,
 174,
 175,
 176,
 177,
 178,
 179,
 180,
 181,
 182,
 183,
 184,
 185,
 186,
 187,
 188,
 189,
 190,
 191,
 192,
 193,
 194,
 195,
 196,
 197,
 198,
 199,
 200,
 201,
 202,
 203,
 204,
 205,
 206,
 207,
 208,
 209,
 210,
 211,
 212,
 213,
 214,
 215,
 216,
 217,
 218,
 219,
 220,
 221,
 222,
 223,
 224,
 225,
 226,
 227,
 228,
 229,
 230,
 231,
 232,
 233,
 234,
 235,
 236,
 237,
 238,
 239,
 240,
 241,
 242,
 243,
 244,
 245,
 246,
 247,
 248,
 249,
 250,
 251,
 252,
 253,
 254,
 255,
 256,
 257,
 258,
 259,
 260,
 261,
 262,
 263,
 264,
 265,
 266,
 267,
 268,
 269,
 270,
 271,
 272,
 273,
 274,
 275,
 276,
 277,
 278,
 279,
 280,
 281,
 282,
 283,
 284,
 285,
 286,
 287,
 288,
 289,
 290,
 291,
 292,
 293,
 294,
 295,
 296,
 297,
 298,
 299,
 300,
 301,
 302

Have fun, and good luck. Who knows, maybe you'll walk away with this sweet certificate of achievement.

![certificate of achievement](http://suffolklitlab.org/howto/demystified/images/certificate_w_placeholder.png) 
