<a href="https://colab.research.google.com/github/AndrewSLowe/DS-Unit-2-Applied-Modeling/blob/master/Module3/2_3_3A_applied_modeling_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 3, Module 3*

---


# Applied Modeling, Module 3

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your work.

- [ ] Continue to iterate on your project: data cleaning, exploration, feature engineering, modeling.
- [ ] Make at least 1 partial dependence plot to explain your model.
- [ ] Share at least 1 visualization on Slack.

(If you have not yet completed an initial model yet for your portfolio project, then do today's assignment using your Tanzania Waterpumps model.)

## Stretch Goals
- [ ] Make multiple PDPs with 1 feature in isolation.
- [ ] Make multiple PDPs with 2 features in interaction. 
- [ ] Use Plotly to make a 3D PDP.
- [ ] Make PDPs with categorical feature(s). Use Ordinal Encoder, outside of a pipeline, to encode your data first. If there is a natural ordering, then take the time to encode it that way, instead of random integers. Then use the encoded data with pdpbox.I Get readable category names on your plot, instead of integer category codes.

## Links
- [Christoph Molnar: Interpretable Machine Learning — Partial Dependence Plots](https://christophm.github.io/interpretable-ml-book/pdp.html) + [animated explanation](https://twitter.com/ChristophMolnar/status/1066398522608635904)
- [Kaggle / Dan Becker: Machine Learning Explainability — Partial Dependence Plots](https://www.kaggle.com/dansbecker/partial-plots)
- [Plotly: 3D PDP example](https://plot.ly/scikit-learn/plot-partial-dependence/#partial-dependence-of-house-value-on-median-age-and-average-occupancy)

In [1]:
!pip install category_encoders


Collecting category_encoders
[?25l  Downloading https://files.pythonhosted.org/packages/a0/52/c54191ad3782de633ea3d6ee3bb2837bda0cf3bc97644bb6375cf14150a0/category_encoders-2.1.0-py2.py3-none-any.whl (100kB)
[K     |████████████████████████████████| 102kB 3.2MB/s 
Installing collected packages: category-encoders
Successfully installed category-encoders-2.1.0


In [2]:
import os
import glob
import pandas as pd

atp_matches_1968 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1968.csv'
atp_matches_1969 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1969.csv'
atp_matches_1970 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1970.csv'
atp_matches_1971 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1971.csv'
atp_matches_1972 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1972.csv'
atp_matches_1973 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1973.csv'
atp_matches_1974 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1974.csv'
atp_matches_1975 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1975.csv'
atp_matches_1976 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1976.csv'
atp_matches_1977 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1977.csv'
atp_matches_1978 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1978.csv'
atp_matches_1979 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1979.csv'
atp_matches_1980 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1980.csv'
atp_matches_1981 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1981.csv'
atp_matches_1982 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1982.csv'
atp_matches_1983 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1983.csv'
atp_matches_1984 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1984.csv'
atp_matches_1985 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1985.csv'
atp_matches_1986 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1986.csv'
atp_matches_1987 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1987.csv'
atp_matches_1988 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1988.csv'
atp_matches_1989 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1989.csv'
atp_matches_1990 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1990.csv'
atp_matches_1991 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1991.csv'
atp_matches_1992 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1992.csv'
atp_matches_1993 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1993.csv'
atp_matches_1994 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1994.csv'
atp_matches_1995 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1995.csv'
atp_matches_1996 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1996.csv'
atp_matches_1997 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1997.csv'
atp_matches_1998 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1998.csv'
atp_matches_1999 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1999.csv'
atp_matches_2000 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2000.csv'
atp_matches_2001 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2001.csv'
atp_matches_2002 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2002.csv'
atp_matches_2003 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2003.csv'
atp_matches_2004 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2004.csv'
atp_matches_2005 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2005.csv'
atp_matches_2006 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2006.csv'
atp_matches_2007 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2007.csv'
atp_matches_2008 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2008.csv'
atp_matches_2009 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2009.csv'
atp_matches_2010 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2010.csv'
atp_matches_2011 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2011.csv'
atp_matches_2012 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2012.csv'
atp_matches_2013 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2013.csv'
atp_matches_2014 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2014.csv'
atp_matches_2015 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2015.csv'
atp_matches_2016 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2016.csv'
atp_matches_2017 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2017.csv'
atp_matches_2018 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2018.csv'
atp_matches_2019 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2019.csv'

atp_matches_1968 = pd.read_csv(atp_matches_1968)
atp_matches_1969 = pd.read_csv(atp_matches_1969)
atp_matches_1970 = pd.read_csv(atp_matches_1970)
atp_matches_1971 = pd.read_csv(atp_matches_1971)
atp_matches_1972 = pd.read_csv(atp_matches_1972)
atp_matches_1973 = pd.read_csv(atp_matches_1973)
atp_matches_1974 = pd.read_csv(atp_matches_1974)
atp_matches_1975 = pd.read_csv(atp_matches_1975)
atp_matches_1976 = pd.read_csv(atp_matches_1976)
atp_matches_1977 = pd.read_csv(atp_matches_1977)
atp_matches_1978 = pd.read_csv(atp_matches_1978)
atp_matches_1979 = pd.read_csv(atp_matches_1979)
atp_matches_1980 = pd.read_csv(atp_matches_1980)
atp_matches_1981 = pd.read_csv(atp_matches_1981)
atp_matches_1982 = pd.read_csv(atp_matches_1982)
atp_matches_1983 = pd.read_csv(atp_matches_1983)
atp_matches_1984 = pd.read_csv(atp_matches_1984)
atp_matches_1985 = pd.read_csv(atp_matches_1985)
atp_matches_1986 = pd.read_csv(atp_matches_1986)
atp_matches_1987 = pd.read_csv(atp_matches_1987)
atp_matches_1988 = pd.read_csv(atp_matches_1988)
atp_matches_1989 = pd.read_csv(atp_matches_1989)
atp_matches_1990 = pd.read_csv(atp_matches_1990)
atp_matches_1991 = pd.read_csv(atp_matches_1991)
atp_matches_1992 = pd.read_csv(atp_matches_1992)
atp_matches_1993 = pd.read_csv(atp_matches_1993)
atp_matches_1994 = pd.read_csv(atp_matches_1994)
atp_matches_1995 = pd.read_csv(atp_matches_1995)
atp_matches_1996 = pd.read_csv(atp_matches_1996)
atp_matches_1997 = pd.read_csv(atp_matches_1997)
atp_matches_1998 = pd.read_csv(atp_matches_1998)
atp_matches_1999 = pd.read_csv(atp_matches_1999)
atp_matches_2000 = pd.read_csv(atp_matches_2000)
atp_matches_2001 = pd.read_csv(atp_matches_2001)
atp_matches_2002 = pd.read_csv(atp_matches_2002)
atp_matches_2003 = pd.read_csv(atp_matches_2003)
atp_matches_2004 = pd.read_csv(atp_matches_2004)
atp_matches_2005 = pd.read_csv(atp_matches_2005)
atp_matches_2006 = pd.read_csv(atp_matches_2006)
atp_matches_2007 = pd.read_csv(atp_matches_2007)
atp_matches_2008 = pd.read_csv(atp_matches_2008)
atp_matches_2009 = pd.read_csv(atp_matches_2009)
atp_matches_2010 = pd.read_csv(atp_matches_2010)
atp_matches_2011 = pd.read_csv(atp_matches_2011)
atp_matches_2012 = pd.read_csv(atp_matches_2012)
atp_matches_2013 = pd.read_csv(atp_matches_2013)
atp_matches_2014 = pd.read_csv(atp_matches_2014)
atp_matches_2015 = pd.read_csv(atp_matches_2015)
atp_matches_2016 = pd.read_csv(atp_matches_2016)
atp_matches_2017 = pd.read_csv(atp_matches_2017)
atp_matches_2018 = pd.read_csv(atp_matches_2018)
atp_matches_2019 = pd.read_csv(atp_matches_2019)


atp_matches_1968['year'] = 1968
atp_matches_1969['year'] = 1969
atp_matches_1970['year'] = 1970
atp_matches_1971['year'] = 1971
atp_matches_1972['year'] = 1972
atp_matches_1973['year'] = 1973
atp_matches_1974['year'] = 1974
atp_matches_1975['year'] = 1975
atp_matches_1976['year'] = 1976
atp_matches_1977['year'] = 1977
atp_matches_1978['year'] = 1978
atp_matches_1979['year'] = 1979
atp_matches_1980['year'] = 1980
atp_matches_1981['year'] = 1981
atp_matches_1982['year'] = 1982
atp_matches_1983['year'] = 1983
atp_matches_1984['year'] = 1984
atp_matches_1985['year'] = 1985
atp_matches_1986['year'] = 1986
atp_matches_1987['year'] = 1987
atp_matches_1988['year'] = 1988
atp_matches_1989['year'] = 1989
atp_matches_1990['year'] = 1990
atp_matches_1991['year'] = 1991
atp_matches_1992['year'] = 1992
atp_matches_1993['year'] = 1993
atp_matches_1994['year'] = 1994
atp_matches_1995['year'] = 1995
atp_matches_1996['year'] = 1996
atp_matches_1997['year'] = 1997
atp_matches_1998['year'] = 1998
atp_matches_1999['year'] = 1999
atp_matches_2000['year'] = 2000
atp_matches_2001['year'] = 2001
atp_matches_2002['year'] = 2002
atp_matches_2003['year'] = 2003
atp_matches_2004['year'] = 2004
atp_matches_2005['year'] = 2005
atp_matches_2006['year'] = 2006
atp_matches_2007['year'] = 2007
atp_matches_2008['year'] = 2008
atp_matches_2009['year'] = 2009
atp_matches_2010['year'] = 2010
atp_matches_2011['year'] = 2011
atp_matches_2012['year'] = 2012
atp_matches_2013['year'] = 2013
atp_matches_2014['year'] = 2014
atp_matches_2015['year'] = 2015
atp_matches_2016['year'] = 2016
atp_matches_2017['year'] = 2017
atp_matches_2018['year'] = 2018
atp_matches_2019['year'] = 2019

tennis = [atp_matches_1968,
         atp_matches_1970,
         atp_matches_1971,
         atp_matches_1972,
         atp_matches_1973,
         atp_matches_1974,
         atp_matches_1975,
         atp_matches_1976,
         atp_matches_1977,
         atp_matches_1978,
         atp_matches_1979,
         atp_matches_1980,
         atp_matches_1981,
         atp_matches_1982,
         atp_matches_1983,
         atp_matches_1984,
         atp_matches_1985,
         atp_matches_1986,
         atp_matches_1987,
         atp_matches_1988,
         atp_matches_1989,
         atp_matches_1990,
         atp_matches_1991, 
         atp_matches_1992,
         atp_matches_1993,
         atp_matches_1994,
         atp_matches_1995,
         atp_matches_1996,
         atp_matches_1997,
         atp_matches_1998,
         atp_matches_1999,
         atp_matches_2000,
         atp_matches_2001,
         atp_matches_2002,
         atp_matches_2003,
         atp_matches_2004,
         atp_matches_2005,
         atp_matches_2006,
         atp_matches_2007,
         atp_matches_2008,
         atp_matches_2009,
         atp_matches_2010,
         atp_matches_2011,
         atp_matches_2012,
         atp_matches_2013,
         atp_matches_2014,
         atp_matches_2015,
         atp_matches_2016,
         atp_matches_2017,
         atp_matches_2018,
         atp_matches_2019,
         ]

tennis = pd.concat(tennis)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.




In [6]:
import numpy as np

def coin_flip():
  p = 0.5
  r = np.random.random()
  return 'Player 1' if r > p else 'Player 2'

tennis['players'] = pd.DataFrame([coin_flip() for i in range(tennis.shape[0])])
group = tennis.groupby(['players']).agg(['count'])
print(group.head())

         best_of draw_size l_1stIn  ... winner_rank_points winner_seed   year
           count     count   count  ...              count       count  count
players                             ...                                      
Player 1   81770       611   40392  ...              45732       31500  81770
Player 2   86532       621   43023  ...              48559       33364  86532

[2 rows x 50 columns]


In [7]:
set_1 = tennis[tennis.players.str.contains('Player 2')]
set_2 = tennis[tennis.players.str.contains('Player 1')]
set_1.shape, set_2.shape

((86532, 51), (81770, 51))

In [0]:
set_1 = set_1.rename(columns={'loser_name' : 'Player A'})
set_1['Player B'] = set_1.winner_name
set_2 = set_2.rename(columns={'loser_name' : 'Player B'})
set_2['Player A'] = set_2['winner_name']
set_1['Winner'] = 'Player B'
set_2['Winner'] = 'Player A'


In [0]:
set_1 = set_1.rename(columns={'loser_name' : 'Player A'})
set_1['Player B'] = set_1.winner_name

#Plater B is the winner and player A is the loser
set_1 = set_1.rename(columns={'loser_age' : 'Player A Name',
                     'loser_entry': 'Player A Entry',
                     'loser_hand': 'Player A Hand',
                     'loser_ht': 'Player A Height',
                      'loser_id': 'Player A ID',
                      'loser_ioc': 'Player A ioc',
                      'loser_rank': 'Player A rank',
                      'loser_rank_points': 'Player A points',
                      'loser_seed': 'Player A seed',
                      'l_1stIn' : 'Player A 1stIn',
                      'l_1stWon': 'Player A 1stWon',
                      'l_2ndWon' : 'Player A 2ndwon',
                      'l_SvGms' : 'Player A SvGms',
                      'l_ace': 'Player A ace',
                      'l_bpFaced': 'Player A bpFaced',
                      'l_bpSaved': 'Player A bpSaved',
                      'l_df': 'Player A df',
                      'l_svpt': 'Player A svpt', 
                      
                      'winner_age' : 'Player B Name',
                     'winner_entry': 'Player B Entry',
                     'winner_hand': 'Player B Hand',
                     'winner_ht': 'Player B Height',
                      'winner_id': 'Player B ID',
                      'winner_ioc': 'Player B ioc',
                      'winner_rank': 'Player B rank',
                      'winner_rank_points': 'Player B points',
                      'winner_seed': 'Player B seed',
                      'w_1stIn' : 'Player B 1stIn',
                      'w_1stWon': 'Player B 1stWon',
                      'w_2ndWon' : 'Player B 2ndwon',
                      'w_SvGms' : 'Player B SvGms',
                      'w_ace': 'Player B ace',
                      'w_bpFaced': 'Player B bpFaced',
                      'w_bpSaved': 'Player B bpSaved',
                      'w_df': 'Player B df',
                      'w_svpt': 'Player B svpt', 
                     })



In [0]:
set_2 = set_2.rename(columns={'loser_age' : 'Player B Name',
                     'loser_entry': 'Player B Entry',
                     'loser_hand': 'Player B Hand',
                     'loser_ht': 'Player B Height',
                      'loser_id': 'Player B ID',
                      'loser_ioc': 'Player B ioc',
                      'loser_rank': 'Player B rank',
                      'loser_rank_points': 'Player B points',
                      'loser_seed': 'Player B seed',
                      'l_1stIn' : 'Player B 1stIn',
                      'l_1stWon': 'Player B 1stWon',
                      'l_2ndWon' : 'Player B 2ndwon',
                      'l_SvGms' : 'Player B SvGms',
                      'l_ace': 'Player B ace',
                      'l_bpFaced': 'Player B bpFaced',
                      'l_bpSaved': 'Player B bpSaved',
                      'l_df': 'Player B df',
                      'l_svpt': 'Player B svpt', 
                      
                      'winner_age' : 'Player A Name',
                     'winner_entry': 'Player A Entry',
                     'winner_hand': 'Player A Hand',
                     'winner_ht': 'Player A Height',
                      'winner_id': 'Player A ID',
                      'winner_ioc': 'Player A ioc',
                      'winner_rank': 'Player A rank',
                      'winner_rank_points': 'Player A points',
                      'winner_seed': 'Player A seed',
                      'w_1stIn' : 'Player A 1stIn',
                      'w_1stWon': 'Player A 1stWon',
                      'w_2ndWon' : 'Player A 2ndwon',
                      'w_SvGms' : 'Player A SvGms',
                      'w_ace': 'Player A ace',
                      'w_bpFaced': 'Player A bpFaced',
                      'w_bpSaved': 'Player A bpSaved',
                      'w_df': 'Player A df',
                      'w_svpt': 'Player A svpt', 
                     })

In [30]:
tennis2 = pd.concat([set_1, set_2])
print(tennis2.shape)
tennis2.head()

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  """Entry point for launching an IPython kernel.


(168302, 53)


Unnamed: 0,Player A,Player A 1stIn,Player A 1stWon,Player A 2ndwon,Player A Entry,Player A Hand,Player A Height,Player A ID,Player A Name,Player A SvGms,Player A ace,Player A bpFaced,Player A bpSaved,Player A df,Player A ioc,Player A points,Player A rank,Player A seed,Player A svpt,Player B,Player B 1stIn,Player B 1stWon,Player B 2ndwon,Player B Entry,Player B Hand,Player B Height,Player B ID,Player B Name,Player B SvGms,Player B ace,Player B bpFaced,Player B bpSaved,Player B df,Player B ioc,Player B points,Player B rank,Player B seed,Player B svpt,Winner,best_of,draw_size,match_num,minutes,players,round,score,surface,tourney_date,tourney_id,tourney_level,tourney_name,winner_name,year
1,Ernie Mccabe,,,,,R,,106964,,,,,,,AUS,,,,,John Brown,,,,,R,,109803,27.520876,,,,,,AUS,,,,,Player B,5,64.0,2,,Player 2,R64,6-3 6-2 6-4,Grass,19680119,1968-580,G,Australian Chps.,John Brown,1968
3,Robert Layton,,,,,R,,110025,,,,,,,AUS,,,,,Allan Stone,,,,,R,,100105,22.264203,,,,,,AUS,,,5.0,,Player B,5,64.0,4,,Player 2,R64,6-4 6-2 6-1,Grass,19680119,1968-580,G,Australian Chps.,Allan Stone,1968
7,Tony Dawson,,,,,R,,108430,,,,,,,AUS,,,,,Barry Phillips Moore,,,,,R,173.0,100025,30.529774,,,,,,AUS,,,3.0,,Player B,5,64.0,8,,Player 2,R64,6-3 6-0 6-3,Grass,19680119,1968-580,G,Australian Chps.,Barry Phillips Moore,1968
8,Peter Oatey,,,,,R,,110029,,,,,,,AUS,,,,,William Coghlan,,,,,R,,108519,,,,,,,AUS,,,,,Player B,5,64.0,9,,Player 2,R64,6-0 6-2 9-11 6-3,Grass,19680119,1968-580,G,Australian Chps.,William Coghlan,1968
13,J May,,,,,R,,110035,,,,,,,AUS,,,,,Merv Guse,,,,,R,,110034,,,,,,,AUS,,,,,Player B,5,64.0,14,,Player 2,R64,6-1 6-2 6-2,Grass,19680119,1968-580,G,Australian Chps.,Merv Guse,1968


In [31]:
set_2.columns

Index(['best_of', 'draw_size', 'Player B 1stIn', 'Player B 1stWon',
       'Player B 2ndwon', 'Player B SvGms', 'Player B ace', 'Player B bpFaced',
       'Player B bpSaved', 'Player B df', 'Player B svpt', 'Player B Name',
       'Player B Entry', 'Player B Hand', 'Player B Height', 'Player B ID',
       'Player B ioc', 'Player B', 'Player B rank', 'Player B points',
       'Player B seed', 'match_num', 'minutes', 'round', 'score', 'surface',
       'tourney_date', 'tourney_id', 'tourney_level', 'tourney_name',
       'Player A 1stIn', 'Player A 1stWon', 'Player A 2ndwon',
       'Player A SvGms', 'Player A ace', 'Player A bpFaced',
       'Player A bpSaved', 'Player A df', 'Player A svpt', 'Player A Name',
       'Player A Entry', 'Player A Hand', 'Player A Height', 'Player A ID',
       'Player A ioc', 'winner_name', 'Player A rank', 'Player A points',
       'Player A seed', 'year', 'players', 'Player A', 'Winner'],
      dtype='object')

In [33]:
Player_A = input('please enter player A: ')
Player_B = input(' please enter player B: ')


please enter player A: Roger Federer
 please enter player B: Rafael Nadal


In [34]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

target = 'Winner'

train, test = train_test_split(tennis2, test_size=.15,  
                              random_state=42)
train, val = train_test_split(train, test_size=.2,  
                              random_state=42)
train.shape, val.shape, test.shape

((114444, 53), (28612, 53), (25246, 53))

In [35]:
y_train = train[target]
y_train.value_counts(normalize=True)     #This is our baseline, that Player B wins slightly over 50% of the time.

Player B    0.514286
Player A    0.485714
Name: Winner, dtype: float64

In [38]:
if "Player B" >"Player A":
  print(Player_B)
else:
  print(Player_A)

Rafael Nadal


In [37]:
y_val = val[target]
y_val.value_counts(normalize=True)      #This is just about the same

Player B    0.515937
Player A    0.484063
Name: Winner, dtype: float64

In [19]:
train.head()

Unnamed: 0,Player A,Player A 1stIn,Player A 1stWon,Player A 2ndwon,Player A Entry,Player A Hand,Player A Height,Player A ID,Player A Name,Player A SvGms,Player A ace,Player A bpFaced,Player A bpSaved,Player A df,Player A ioc,Player A points,Player A rank,Player A seed,Player A svpt,Player B,Player B 1stIn,Player B 1stWon,Player B 2ndwon,Player B Entry,Player B Hand,Player B Height,Player B ID,Player B Name,Player B SvGms,Player B ace,Player B bpFaced,Player B bpSaved,Player B df,Player B ioc,Player B points,Player B rank,Player B seed,Player B svpt,Winner,best_of,draw_size,match_num,minutes,players,round,score,surface,tourney_date,tourney_id,tourney_level,tourney_name,winner_name,year
5,Arvind Parmar,64.0,43.0,24.0,Q,R,193.0,103157,24.77,15.0,4.0,2.0,1.0,3.0,GBR,214.0,170.0,,96.0,Harel Levy,58.0,38.0,22.0,,R,185.0,103248,24.4,14.0,3.0,13.0,9.0,0.0,ISR,114.0,252.0,,96.0,Player A,3,,6,145.0,Player 1,R32,6-7(5) 6-4 6-1,Hard,20021230,2003-339,A,Adelaide,Arvind Parmar,2003
2430,Fred Stolle,,,,,R,190.0,100030,33.03,,,,,,AUS,,,,,Martin Mulligan,,,,,R,,100050,31.0,,,,,,AUS,,,,,Player B,3,,47,,Player 2,QF,6-4 6-3,Clay,19711018,1971-425,A,Barcelona WCT,Martin Mulligan,1971
1007,Bjorn Munroe,,,,,R,,108663,25.4,,,,,,BAH,,,,,Alejandro Hernandez,,,,,R,180.0,103060,26.52,,,,,,MEX,143.0,251.0,,,Player B,3,,5,,Player 2,RR,6-2 7-6(2),Clay,20040409,2004-D058,D,Davis Cup G2 R2: MEX vs BAH,Alejandro Hernandez,2004
2753,Gustavo Kuerten,36.0,29.0,22.0,,R,190.0,102856,20.94,11.0,11.0,5.0,4.0,3.0,BRA,2228.0,9.0,4.0,73.0,Marcelo Filippini,39.0,34.0,17.0,,R,178.0,101382,30.04,11.0,7.0,7.0,5.0,3.0,URU,958.0,45.0,,79.0,Player A,3,,12,89.0,Player 1,R32,6-4 7-6(5),Hard,19970818,1997-441,A,Long Island,Gustavo Kuerten,1997
2034,Ivan Lendl,,,,,R,188.0,100656,24.48,,,,,,USA,,2.0,2.0,,John Mcenroe,,,,,L,180.0,100581,25.53,,,,,,USA,,1.0,1.0,,Player B,5,,7,,Player 2,F,6-3 6-4 6-1,Hard,19840828,1984-560,G,US Open,John Mcenroe,1984


In [20]:
X_val = val.drop(columns=['Winner', 'winner_name', 'players'])
X_train = train.drop(columns=['Winner', 'winner_name', 'players'])
X_val.head()

Unnamed: 0,Player A,Player A 1stIn,Player A 1stWon,Player A 2ndwon,Player A Entry,Player A Hand,Player A Height,Player A ID,Player A Name,Player A SvGms,Player A ace,Player A bpFaced,Player A bpSaved,Player A df,Player A ioc,Player A points,Player A rank,Player A seed,Player A svpt,Player B,Player B 1stIn,Player B 1stWon,Player B 2ndwon,Player B Entry,Player B Hand,Player B Height,Player B ID,Player B Name,Player B SvGms,Player B ace,Player B bpFaced,Player B bpSaved,Player B df,Player B ioc,Player B points,Player B rank,Player B seed,Player B svpt,best_of,draw_size,match_num,minutes,round,score,surface,tourney_date,tourney_id,tourney_level,tourney_name,year
2772,Richard Pancho Gonzales,,,,,R,188.0,100005,45.28,,,,,,USA,,,,,Vijay Amritraj,,,,,R,193.0,100321,19.68,,,,,,IND,,,7.0,,3,,29,,SF,6-2 6-3,Grass,19730820,1973-707,A,South Orange,1973
1087,Lukas Rosol,48.0,35.0,21.0,,R,196.0,104586,30.74,12.0,8.0,6.0,4.0,3.0,CZE,782.0,61.0,,84.0,Kyle Edmund,48.0,38.0,18.0,,R,,106378,21.28,12.0,11.0,3.0,1.0,3.0,GBR,608.0,92.0,,80.0,3,,280,103.0,R32,7-6(4) 7-6(5),Clay,20160418,2016-0773,A,Bucharest,2016
38,Stephane Simian,23.0,11.0,11.0,,L,193.0,101362,29.56,7.0,1.0,4.0,0.0,2.0,FRA,434.0,92.0,,44.0,Petr Korda,23.0,17.0,17.0,,L,190.0,101434,28.94,8.0,2.0,1.0,1.0,1.0,CZE,866.0,24.0,6.0,45.0,3,,8,55.0,R32,6-1 6-2,Hard,19961230,1997-451,A,Doha,1997
529,Boris Becker,36.0,29.0,9.0,,R,190.0,101414,26.21,9.0,8.0,2.0,1.0,4.0,GER,1729.0,13.0,5.0,59.0,Ronald Agenor,40.0,22.0,19.0,,R,180.0,101086,29.23,10.0,2.0,9.0,5.0,0.0,USA,677.0,55.0,,80.0,3,,29,107.0,SF,7-6(6) 6-1,Carpet,19940207,1994-408,A,Milan,1994
1822,Olivier Mutis,96.0,62.0,45.0,,R,175.0,103133,25.39,26.0,5.0,19.0,13.0,9.0,FRA,554.0,78.0,,177.0,Paradorn Srichaphan,101.0,74.0,32.0,,R,185.0,103387,24.02,27.0,15.0,14.0,6.0,6.0,THA,1760.0,11.0,12.0,178.0,5,,69,217.0,R64,4-6 1-6 7-6(4) 7-5 7-5,Grass,20030623,2003-540,G,Wimbledon,2003


In [0]:
import category_encoders as ce

In [40]:
import category_encoders as ce
import numpy as np
from sklearn.feature_selection import f_regression, SelectKBest
from sklearn.impute import SimpleImputer
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier

pipeline = make_pipeline(
    ce.OrdinalEncoder(),                     #min_samples_leaf is the lowest proportion for samples each tree will get to.
    SimpleImputer(strategy='median'), 
    RandomForestClassifier(n_estimators=100, n_jobs=-1, random_state=42)
)

pipeline.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('ordinalencoder',
                 OrdinalEncoder(cols=['Player A', 'Player A Entry',
                                      'Player A Hand', 'Player A ioc',
                                      'Player B', 'Player B Entry',
                                      'Player B Hand', 'Player B ioc', 'round',
                                      'score', 'surface', 'tourney_id',
                                      'tourney_level', 'tourney_name'],
                                drop_invariant=False, handle_missing='value',
                                handle_unknown='value',
                                mapping=[{'col': 'Player A',
                                          'data_type': dty...
                ('randomforestclassifier',
                 RandomForestClassifier(bootstrap=True, class_weight=None,
                                        criterion='gini', max_depth=None,
                                        max_features='auto',
  

In [41]:
pipeline.score(X_train, y_train)

1.0

In [26]:
Player_A = input('please enter player A: ')
Player_B = input(' please enter player B: ')
print(Player_A, ' vs. ', Player_B)


please enter player A: Roger Federer
 please enter player B: Rafael Nadal
Roger Federer  vs.  Rafael Nadal


In [27]:
tennis2.head()

Unnamed: 0,Player A,Player A 1stIn,Player A 1stWon,Player A 2ndwon,Player A Entry,Player A Hand,Player A Height,Player A ID,Player A Name,Player A SvGms,Player A ace,Player A bpFaced,Player A bpSaved,Player A df,Player A ioc,Player A points,Player A rank,Player A seed,Player A svpt,Player B,Player B 1stIn,Player B 1stWon,Player B 2ndwon,Player B Entry,Player B Hand,Player B Height,Player B ID,Player B Name,Player B SvGms,Player B ace,Player B bpFaced,Player B bpSaved,Player B df,Player B ioc,Player B points,Player B rank,Player B seed,Player B svpt,Winner,best_of,draw_size,match_num,minutes,players,round,score,surface,tourney_date,tourney_id,tourney_level,tourney_name,winner_name,year
1,Ernie Mccabe,,,,,R,,106964,,,,,,,AUS,,,,,John Brown,,,,,R,,109803,27.520876,,,,,,AUS,,,,,Player B,5,64.0,2,,Player 2,R64,6-3 6-2 6-4,Grass,19680119,1968-580,G,Australian Chps.,John Brown,1968
3,Robert Layton,,,,,R,,110025,,,,,,,AUS,,,,,Allan Stone,,,,,R,,100105,22.264203,,,,,,AUS,,,5.0,,Player B,5,64.0,4,,Player 2,R64,6-4 6-2 6-1,Grass,19680119,1968-580,G,Australian Chps.,Allan Stone,1968
7,Tony Dawson,,,,,R,,108430,,,,,,,AUS,,,,,Barry Phillips Moore,,,,,R,173.0,100025,30.529774,,,,,,AUS,,,3.0,,Player B,5,64.0,8,,Player 2,R64,6-3 6-0 6-3,Grass,19680119,1968-580,G,Australian Chps.,Barry Phillips Moore,1968
8,Peter Oatey,,,,,R,,110029,,,,,,,AUS,,,,,William Coghlan,,,,,R,,108519,,,,,,,AUS,,,,,Player B,5,64.0,9,,Player 2,R64,6-0 6-2 9-11 6-3,Grass,19680119,1968-580,G,Australian Chps.,William Coghlan,1968
13,J May,,,,,R,,110035,,,,,,,AUS,,,,,Merv Guse,,,,,R,,110034,,,,,,,AUS,,,,,Player B,5,64.0,14,,Player 2,R64,6-1 6-2 6-2,Grass,19680119,1968-580,G,Australian Chps.,Merv Guse,1968
