# Data Preprocessing
In this section, we load the information about cryptocurrencies from the provided CSV file and perform some data preprocessing tasks. The data was retrieved from [CryptoCompare](https://min-api.cryptocompare.com/data/all/coinlist).  

In [1]:
# Import libraries
import pandas as pd

Start by loading the data in a Pandas DataFrame named **“crypto_df.”** Continue with the following data preprocessing tasks:  

- Remove all cryptocurrencies that aren’t trading.
- Remove all cryptocurrencies that don’t have an algorithm defined.
- Remove the IsTrading column.
- Remove all cryptocurrencies with at least one null value.
- Remove all cryptocurrencies without coins mined.
- Store the names of all cryptocurrencies on a DataFramed named coins_name, and use the crypto_df.index as the index for this new DataFrame.
- Remove the CoinName column.
- Create dummies variables for all of the text features, and store the resulting data on a DataFrame named X.
- Use the StandardScaler from sklearn to standardize all of the data from the X DataFrame. Remember, this is important prior to using PCA and K-means algorithms.

In [2]:
# Load the dataset in a Pandas DataFrame
file_path = "../Resources/crypto_data.csv"
crypto_df = pd.read_csv(file_path)
crypto_df.head()

Unnamed: 0.1,Unnamed: 0,CoinName,Algorithm,IsTrading,ProofType,TotalCoinsMined,TotalCoinSupply
0,42,42 Coin,Scrypt,True,PoW/PoS,41.99995,42
1,365,365Coin,X11,True,PoW/PoS,,2300000000
2,404,404Coin,Scrypt,True,PoW/PoS,1055185000.0,532000000
3,611,SixEleven,SHA-256,True,PoW,,611000
4,808,808,SHA-256,True,PoW/PoS,0.0,0


##  First, account for the data you have. 
### **What data is available?**

In [3]:
crypto_df.shape

(1252, 7)

First, account for the data you have. After all, you can’t extract knowledge without data. We can use the columns method and output the columns, as shown below:

In [4]:
# Columns 
crypto_df.columns

Index(['Unnamed: 0', 'CoinName', 'Algorithm', 'IsTrading', 'ProofType',
       'TotalCoinsMined', 'TotalCoinSupply'],
      dtype='object')

### **What type of data is available?**

Using the dtypes method, confirm the data type, which also will alert us if anything should be changedin the next step (e.g., converting text to numerical data). All the columns we plan to use in our model must contain a numerical data type:


In [5]:
# List dataframe data types
crypto_df.dtypes

Unnamed: 0          object
CoinName            object
Algorithm           object
IsTrading             bool
ProofType           object
TotalCoinsMined    float64
TotalCoinSupply     object
dtype: object

Unnamed, CoinName, Algorithm, IsTrading, ProofType, and TotalCoinSupply columns contains a data type object, which is not numerical.

### What data is missing?

See if any data is missing. Unsupervised learning models can’t handle missing data. If you try to run a model on a dataset with missing data, you’ll get an error.

In [6]:
# Find null values
for column in crypto_df.columns:
    print(f"Column {column} has {crypto_df[column].isnull().sum()} null values")

Column Unnamed: 0 has 0 null values
Column CoinName has 0 null values
Column Algorithm has 0 null values
Column IsTrading has 0 null values
Column ProofType has 0 null values
Column TotalCoinsMined has 508 null values
Column TotalCoinSupply has 0 null values


**Remove all cryptocurrencies that don’t have an algorithm defined.**

The algorithm column has 0 null values. All are defined. However, the TotalCoinsMined column has 508 null values.

### What data can be removed?

**Remove all cryptocurrencies with at least one null value.**  

Determine if the data can be removed. Consider: Are there string columns that we can’t use? Are there columns with excessive null data points? Was our decision to handle missing values to just remove them?
In our example, only a 508 rows have null data points, but not enough to remove a whole column’s worth. Rows of data with null values can be removed with the dropna() method.

In [7]:
# Drop null rows
crypto_df = crypto_df.dropna()
crypto_df.shape

(744, 7)

Duplicates can also be removed. Having duplicate rows means the data has already been recorded, so no new information can be obtained from them. Keeping the rows in the dataset could affect the results by giving those data points too much weight. Use the duplicated().sum() method to check for duplicates.

In [8]:
# Find duplicate entries
print(f"Duplicate entries: {crypto_df.duplicated().sum()}")

Duplicate entries: 0


**Remove all cryptocurrencies that aren’t trading.**  

In [9]:
# Transform String column
def change_string(IsTrading):
    if IsTrading == "False":
        return 0
    else:
        return 1
    
crypto_df["IsTrading"] = crypto_df["IsTrading"].apply(change_string)
crypto_df.shape

(744, 7)

In [10]:
crypto_df = crypto_df[-(crypto_df == 0).any(axis=1)]
crypto_df.shape

(578, 7)

**Remove the IsTrading column**  
Having all **"IsTrading"** rows means no new information can be obtained from them. We can drop the column.

In [11]:
# Remove the column
crypto_df.drop(columns=["IsTrading"], inplace=True)
crypto_df.head()

Unnamed: 0.1,Unnamed: 0,CoinName,Algorithm,ProofType,TotalCoinsMined,TotalCoinSupply
0,42,42 Coin,Scrypt,PoW/PoS,41.99995,42
2,404,404Coin,Scrypt,PoW/PoS,1055185000.0,532000000
5,1337,EliteCoin,X13,PoW/PoS,29279420000.0,314159265359
7,BTC,Bitcoin,SHA-256,PoW,17927180.0,21000000
8,ETH,Ethereum,Ethash,PoW,107684200.0,0


In [12]:
crypto_df.shape

(578, 6)

**Remove all cryptocurrencies without coins mined.**

In [13]:
pd.options.display.float_format = '{:,.0f}'.format

In [14]:
crypto_df = crypto_df[crypto_df.TotalCoinsMined > 0]

In [15]:
crypto_df.shape

(577, 6)

**Store the names of all cryptocurrencies on a DataFramed named coins_name, and use the crypto_df.index as the index for this new DataFrame.**  

**Remove the CoinName column.**

In [16]:
coins_name = crypto_df.set_index(["Unnamed: 0"])
coins_name

Unnamed: 0_level_0,CoinName,Algorithm,ProofType,TotalCoinsMined,TotalCoinSupply
Unnamed: 0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
42,42 Coin,Scrypt,PoW/PoS,42,42
404,404Coin,Scrypt,PoW/PoS,1055184902,532000000
1337,EliteCoin,X13,PoW/PoS,29279424623,314159265359
BTC,Bitcoin,SHA-256,PoW,17927175,21000000
ETH,Ethereum,Ethash,PoW,107684223,0
...,...,...,...,...,...
GAP,Gapcoin,Scrypt,PoW/PoS,14931046,250000000
BDX,Beldex,CryptoNight,PoW,980222595,1400222610
ZEN,Horizen,Equihash,PoW,7296538,21000000
XBC,BitcoinPlus,Scrypt,PoS,128327,1000000


In [17]:
# Remove the column
coins_name.drop(columns=["CoinName"], inplace=True)
coins_name

Unnamed: 0_level_0,Algorithm,ProofType,TotalCoinsMined,TotalCoinSupply
Unnamed: 0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
42,Scrypt,PoW/PoS,42,42
404,Scrypt,PoW/PoS,1055184902,532000000
1337,X13,PoW/PoS,29279424623,314159265359
BTC,SHA-256,PoW,17927175,21000000
ETH,Ethash,PoW,107684223,0
...,...,...,...,...
GAP,Scrypt,PoW/PoS,14931046,250000000
BDX,CryptoNight,PoW,980222595,1400222610
ZEN,Equihash,PoW,7296538,21000000
XBC,Scrypt,PoS,128327,1000000


### Is the data in a format that can be passed into an unsupervised learning model?

In [18]:
coins_name['TotalCoinSupply'] = coins_name['TotalCoinSupply'].apply(float)

We know that our model can’t have strings passed into it. To make sure we can use our string data, we’ll transform our strings of PoW/PoS, PoS, and PoW from the ProofType column to 0,1, and 2, respectively. The function will then be run on the whole column with the .apply method.

In [19]:
# Transform String column
def change_string(ProofType):
    if ProofType == "PoW/PoS":
        return 0
    if ProofType == "Pos":
        return 1
    else:
        return 2
    
coins_name["ProofType"] = coins_name["ProofType"].apply(change_string)
coins_name.head()

Unnamed: 0_level_0,Algorithm,ProofType,TotalCoinsMined,TotalCoinSupply
Unnamed: 0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
42,Scrypt,0,42,42
404,Scrypt,0,1055184902,532000000
1337,X13,0,29279424623,314159265359
BTC,SHA-256,2,17927175,21000000
ETH,Ethash,2,107684223,0


In [20]:
# Saving cleaned data
file_path = "../Resources/coins_name.csv"
coins_name.to_csv(file_path, index=False)

**Create dummies variables for all of the text features, and store the resulting data on a DataFrame named X.**

In [21]:
X = coins_name[['Algorithm', 'ProofType', 'TotalCoinsMined', 'TotalCoinSupply']].copy()
X = pd.get_dummies(X, columns=['Algorithm'], drop_first=True)
X = X.dropna()
X.head()

Unnamed: 0_level_0,ProofType,TotalCoinsMined,TotalCoinSupply,Algorithm_536,Algorithm_Argon2d,Algorithm_BLAKE256,Algorithm_Blake,Algorithm_Blake2S,Algorithm_Blake2b,Algorithm_C11,...,Algorithm_Tribus,Algorithm_VBFT,Algorithm_VeChainThor Authority,Algorithm_X11,Algorithm_X11GOST,Algorithm_X13,Algorithm_X14,Algorithm_X15,Algorithm_X16R,Algorithm_XEVAN
Unnamed: 0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
42,0,42,42,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
404,0,1055184902,532000000,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1337,0,29279424623,314159265359,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
BTC,2,17927175,21000000,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ETH,2,107684223,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [22]:
X.shape

(577, 75)

## Data Transformation

Data transformation involves thinking about the future. More times than not, there will be new data coming into your data storage (a place where raw data is stored before being touched), with many people working on different types of data analysis. We want to make sure that whoever wants to use the data in the future can do so.

### Can I quickly hand off this data for others to use?
Now that our data has been cleaned and processed, it is ready to be converted to a readable format for future use

In [23]:
# Saving cleaned data
file_path = "../Resources/X.csv"
X.to_csv(file_path, index=False)

**Use the StandardScaler from sklearn to standardize all of the data from the X DataFrame. Remember, this is important prior to using PCA and K-means algorithms.**

In [24]:
from sklearn.preprocessing import MinMaxScaler
X_scaled = MinMaxScaler().fit_transform(X)
X_scaled

array([[0.00000000e+00, 0.00000000e+00, 4.20000000e-11, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 1.06585544e-03, 5.32000000e-04, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 2.95755135e-02, 3.14159265e-01, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       ...,
       [1.00000000e+00, 7.37028150e-06, 2.10000000e-05, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [1.00000000e+00, 1.29582282e-07, 1.00000000e-06, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 2.17085015e-05, 1.00000000e-04, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00]])