<a href="https://colab.research.google.com/github/JuIsa/New-Crypto-Listings/blob/main/listing_pred.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Load data

First I will import some libraries to work with initial data. All listings are actually taken from Bybit website and were saved in byte format using `pickle` library.<br>
File is saved on my Google Drive and fortunately Colab has integrated feature which allows to work with files on Drive.

In [None]:
import pickle
file = open('/content/drive/MyDrive/bybit99.listing','rb')
data = pickle.load(file)
file.close()

In [None]:
import pandas as pd
import numpy as np

After uploading data I can check to see how news announces look like.

In [None]:
df  = pd.DataFrame(data)
df.head()

Unnamed: 0,0
0,Upcoming Changes to Interest-Free Loan Quota f...
1,"Bybit x ApeX Pro: Try ApeX Pro and Share 90,00..."
2,"Deposit with ZEN and Share the 1,000 USDT Priz..."
3,ByVotes Chapter 14: Vote for Your Favorite Pro...
4,Derivatives: Upgrade of Take Profit/Stop Loss ...


And see how many records in total

In [None]:
df.shape

(882, 1)

First column is named `0`, so I am gonna change it to `text`

In [None]:
df.rename(columns={0:'text'},inplace=True)  

In [None]:
df.head()

Unnamed: 0,text
0,Upcoming Changes to Interest-Free Loan Quota f...
1,"Bybit x ApeX Pro: Try ApeX Pro and Share 90,00..."
2,"Deposit with ZEN and Share the 1,000 USDT Priz..."
3,ByVotes Chapter 14: Vote for Your Favorite Pro...
4,Derivatives: Upgrade of Take Profit/Stop Loss ...


There are empty lines, so I can get rid of them by checking the length of each line and masking lines that have less than 1 characters.

Now there are only 784 news

In [None]:
df = df[df['text'].str.len() > 1]
df.shape

(784, 1)

# Prepare data

Now, since the data isnt labeled for training I can perform some operations to create labeling myself. I know that  listing news contain name of a token and the word `USDT`, so my first masking will be  lines that have 'USDT' in them. Also listing news have the word `listing`, so all lines with this are also gonna be a masking. <br>
If i know that these 2 words are in news about listing why not just use `if` statements? There are kinds of listing but not of a new token but of a new trading pair of existing token or futures and I dont want them 


In [None]:
usdt = np.where(df['text'].str.count('USDT')>0,1,0)

In [None]:
listing = np.where(df['text'].str.lower().str.count('listing')>0,1,0)

In [None]:
df['tar_usdt']=usdt
df['tar_list']=listing
df.head()

Unnamed: 0,text,tar_usdt,tar_list
0,Upcoming Changes to Interest-Free Loan Quota f...,1,0
1,"Bybit x ApeX Pro: Try ApeX Pro and Share 90,00...",0,0
2,"Deposit with ZEN and Share the 1,000 USDT Priz...",1,0
3,ByVotes Chapter 14: Vote for Your Favorite Pro...,0,0
4,Derivatives: Upgrade of Take Profit/Stop Loss ...,0,0


Combine 2 masks to get a final label column

In [None]:
target = np.where((df['tar_usdt']==1) & (df['tar_list']==1), 1,0)

In [None]:
df['target']=target

In [None]:
df[df['target']==1].head()

Unnamed: 0,text,tar_usdt,tar_list,target
25,New Listing: TAMA/USDT — Grab a Share of the 4...,1,1,1
37,New Listing: MVL/USDT — Grab a Share of the 23...,1,1,1
72,New Listing: AGI/USDT — Share a Prize Pool Wor...,1,1,1
75,New Listing: CGPT/USDT — Grab a Share of the 1...,1,1,1
105,New Listing: BABYDOGE/USDT — Grab a Share of t...,1,1,1


# Train model

Now its time to train a model to recognize new listings.<br>
Import all libraries

In [None]:
from sklearn import svm
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

Divide data into training and testing portions

In [None]:
x_data=df['text']
y_data=df['target']
count_vector = TfidfVectorizer()
extracted_features = count_vector.fit_transform(df['text'])
x_train, x_test, y_train, y_test = train_test_split(extracted_features,df['target'],test_size=0.15) 

In [None]:
x_train

<666x1227 sparse matrix of type '<class 'numpy.float64'>'
	with 5848 stored elements in Compressed Sparse Row format>

In [None]:
tuned_parameters = {'kernel': ['rbf','linear'], 'gamma': [1e-3, 1e-4],'C': [1, 10, 100, 1000]}
model = GridSearchCV(svm.SVC(probability=True), tuned_parameters)
model.fit(x_train,y_train)

print("Model Trained Successfully!")

Model Trained Successfully!


In [None]:
print("Accuracy of the model is: ",model.score(x_test,y_test)*100)


Accuracy of the model is:  96.61016949152543


Accuracy of 96.61% is enough for such task and scale. Later I can new announces to have more training data

# Save model

To reuse this model in the future for actual projects I can save it using `pickle` again.

In [None]:
import pickle


In [None]:
fp = open('/content/drive/MyDrive/bybit.model','wb')
pickle.dump(model,fp)

# Test model with new data

Test some announces that are not in initial dataset to see how model recognizes new listings.

In [None]:
file = open('/content/drive/MyDrive/ptmodel.pt', 'rb')
prediction = pickle.load(file)

In [None]:
input_data = ['Upcoming Changes to Interest-Free Loan Quota for USDT and USDC Assets (UTA/UMA)','New Listing: KARATE/USDT — Grab a Share of the 100,000 USDT Prize Pool!']
input_data = count_vector.transform(input_data)


In [None]:
x = prediction.predict(input_data)
for i, pred in enumerate(x):
  if pred==0:
    print('input #',i+1,'is not listing')
  else:
    print('input #',i+1,'is listing')

input # 1 is not listing
input # 2 is listing


And it works! You can actually read 2 news titles in `input_data` to see that the second one is indeed a new listing announcement. 