# Task 1: Data Preparation

### Imports & format checks

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Import nessasary libraries

In [2]:
# Load the data from the file NBA_players_stats.csv using pandas library.
NBA = pd.read_csv('NBA_players_stats.csv', index_col=0) #prevent index number from affecting result

In [3]:
NBA.columns

Index(['Player', 'Pos', 'Age', 'Tm', 'G', 'GS', 'MP', 'FG', 'FGA', 'FG%', '3P',
       '3PA', '3P%', '2P', '2PA', '2P%', 'FT', 'FTA', 'FT%', 'ORB', 'DRB',
       'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS'],
      dtype='object')

In [4]:
NBA.dtypes # Get data type for each column & verify they are correct

Player     object
Pos        object
Age         int64
Tm         object
G           int64
GS          int64
MP          int64
FG          int64
FGA         int64
FG%       float64
3P          int64
3PA         int64
3P%       float64
2P          int64
2PA         int64
2P%       float64
FT          int64
FTA         int64
FT%       float64
ORB         int64
DRB         int64
TRB         int64
AST         int64
STL         int64
BLK         int64
TOV         int64
PF          int64
PTS         int64
dtype: object

In [5]:
NBA.head(3) # quick check on the first 3 rows, mainly for format

Unnamed: 0_level_0,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,FG%,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
Rk,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,Precious Achiuwa,PF,21,MIA,35,2,491,84,145,0.579,...,0.543,46,95,141,20,15,19,32,58,212
2,Jaylen Adams,PG,24,MIL,7,0,18,1,8,0.125,...,,0,3,3,2,0,0,0,1,20000
3,Steven Adams,C,27,NOP,33,33,918,115,187,0.615,...,0.443,133,161,294,69,29,20,50,63,265


### Checking Data for errors - Player

In [6]:
missing = False 
for feature in NBA:
    for observation in feature:
        if observation is None or observation == '' or observation == np.NaN: # check for None types, empty strings, or the NumPy type NaN
            missing = True
if missing == True:
    print("There are missing values")
else:
    print("There are no missing values")

There are no missing values


The code above checks for any missing values in the dataset, which can be done by inspection but it will take a while.
As we see, there are no missing values. 

In [7]:
duplicate_names = NBA.duplicated('Player')
NBA[['Player']][duplicate_names]

Unnamed: 0_level_0,Player
Rk,Unnamed: 1_level_1
9,Jarrett Allen
9,Jarrett Allen
182,James Harden
182,James Harden
238,Damian Jones
238,Damian Jones
255,Rodions Kurucs
255,Rodions Kurucs
263,Alex Len
263,Alex Len


This code shows us players who appear more than once in the dataset, as well as the row the duplicate is located and deal with it when cleaning the data. 

### Checking Data for errors - Position

### Checking Data for errors - Age

In [8]:
min = NBA["Age"].min()
max = NBA["Age"].max()
print ("The youngest player is aged: ", min)
print ("The oldest player is aged: ", max)

The youngest player is aged:  -19
The oldest player is aged:  280


In [9]:
print ("Players who are over 40: ")
NBA[['Age', 'Player']][NBA.Age > 40]

Players who are over 40: 


Unnamed: 0_level_0,Age,Player
Rk,Unnamed: 1_level_1,Unnamed: 2_level_1
161,280,Anthony Gill


In [10]:
print ("Players who are younger than 18: ")
NBA[['Age', 'Player']][NBA.Age < 18]

Players who are younger than 18: 


Unnamed: 0_level_0,Age,Player
Rk,Unnamed: 1_level_1,Unnamed: 2_level_1
194,-19,Killian Hayes


Upon checking the Age, we see the youngest player is age: -19 and the oldest is 280. These are obviously outside the human life expectancy. 

In [11]:
"""
Make sure that you write the final cleaned dataframe object into a csv file
XXXXX.to_csv('cleaned_NBA_players_stats.csv', index=False)
Please replace XXXXX with your dataframe variable.
"""


"\nMake sure that you write the final cleaned dataframe object into a csv file\nXXXXX.to_csv('cleaned_NBA_players_stats.csv', index=False)\nPlease replace XXXXX with your dataframe variable.\n"

### Checking Data for errors - Team

### Checking Data for errors - Games

### Checking Data for errors - Games Started

### Checking Data for errors - Minutes Played

### Checking Data for errors - Field Goals

### Checking Data for errors - Field Goal Attempts

### Checking Data for errors - Field Goal Percentage

### Checking Data for errors - 3-Point Field Goals

### Checking Data for errors - 3-Point Field Goal Attempts

### Checking Data for errors - 3-Point Field Goal Percentage

### Checking Data for errors - 2-Point Field Goals

### Checking Data for errors - 2-Point Field Goal Attempts

### Checking Data for errors - 2-Point Field Goal Percentage

### Checking Data for errors - Free Throws

### Checking Data for errors - Free Throw Attempts

### Checking Data for errors - Free Throw Percentage

### Checking Data for errors - Offensive Rebounds

### Checking Data for errors - Defensive Rebounds

### Checking Data for errors - Total Rebounds

### Checking Data for errors - Assists

### Checking Data for errors - Steals

### Checking Data for errors - Blocks

### Checking Data for errors - Turnovers

### Checking Data for errors - Personal Fouls

### Checking Data for errors - The Total Points

# Task 2: Data Exploration

## Task 2.1 
Explore the players' total points: Please analyze the composition of the total points of the top five players with the most points.

In [12]:
# Code goes after this line by adding cells


## Task 2.2 
Assuming that the data collector makes an entry error when collecting data, it can be ensured that the error occurred in the 3P, 3PA and 3P% columns, but it is not sure which player's information the error lies on. Please try to explore the error by visualization to identify how many errors there are and try to fix it.


In [13]:
# Code goes after this line by adding cells


## Task 2.3 
Please analyze the relationship between the player's total points and the rest features (columns). Please use at least three other columns.


In [14]:
# Code goes after this line by adding cells
