# Projeto Semestral de Ciência de Dados - Classificação de artigos científicos do site Arxiv

### Etapa 1 - Coleta de Dados

**Instalação de bibliotecas**

In [None]:
!pip install arxiv
!pip install pandas
!pip install sklearn
!pip install tensorflow  # Para usar modelos pré-treinados
!pip install streamlit  # Para criar o app (a instalação é só para testes locais)

Collecting arxiv
  Downloading arxiv-2.1.3-py3-none-any.whl.metadata (6.1 kB)
Collecting feedparser~=6.0.10 (from arxiv)
  Downloading feedparser-6.0.11-py3-none-any.whl.metadata (2.4 kB)
Collecting sgmllib3k (from feedparser~=6.0.10->arxiv)
  Downloading sgmllib3k-1.0.0.tar.gz (5.8 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Downloading arxiv-2.1.3-py3-none-any.whl (11 kB)
Downloading feedparser-6.0.11-py3-none-any.whl (81 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.3/81.3 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: sgmllib3k
  Building wheel for sgmllib3k (setup.py) ... [?25l[?25hdone
  Created wheel for sgmllib3k: filename=sgmllib3k-1.0.0-py3-none-any.whl size=6047 sha256=2bab370b32d991bfb0ffcf96c414e03ed7e13bb5ba2dddd2cf9b833cf00abce4
  Stored in directory: /root/.cache/pip/wheels/f0/69/93/a47e9d621be168e9e33c7ce60524393c0b92ae83cf6c6e89c5
Successfully built sgmllib3k
Installing collected packag

**Coleta de Dados**

In [None]:
import arxiv
import pandas as pd

# Definindo as categorias para coleta (Exemplo: IA e aprendizado de máquina)
categories = ["cs.AI", "stat.ML", "cs.LG", "cs.CR"]
num_articles = 200  # Número de artigos por categoria

# Função para coletar artigos de uma categoria
def fetch_arxiv_data(category, num_articles):
    search = arxiv.Search(
        query=f"cat:{category}",
        max_results=num_articles,
        sort_by=arxiv.SortCriterion.SubmittedDate
    )
    data = []
    for result in search.results():
        data.append({
            'title': result.title,
            'summary': result.summary,
            'category': category
        })
    return pd.DataFrame(data)

# Coletando dados de todas as categorias
df_list = [fetch_arxiv_data(cat, num_articles) for cat in categories]
df = pd.concat(df_list, ignore_index=True)

# Salvando os dados em um arquivo CSV (opcional)
df.to_csv("arxiv_data.csv", index=False)
print(f"Coletamos {len(df)} artigos.")
df.head()


  for result in search.results():


Coletamos 1600 artigos.


Unnamed: 0,title,summary,category
0,Scaling Properties of Diffusion Models for Per...,"In this paper, we argue that iterative computa...",cs.AI
1,GaussianAnything: Interactive Point Cloud Late...,While 3D content generation has advanced signi...,cs.AI
2,Learning with Less: Knowledge Distillation fro...,"In real-world NLP applications, Large Language...",cs.AI
3,LLMPhy: Complex Physical Reasoning Using Large...,Physical reasoning is an important skill neede...,cs.AI
4,Leonardo vindicated: Pythagorean trees for min...,Trees continue to fascinate with their natural...,cs.AI


**Pré processamento dos Dados**

In [None]:
import re
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# Limpeza de texto
def clean_text(text):
    text = re.sub(r'\s+', ' ', text)  # Remove espaços extras
    text = re.sub(r'\W', ' ', text)   # Remove caracteres especiais
    text = text.lower()               # Converte para minúsculas
    return text

# Aplicando limpeza no texto
df['cleaned_summary'] = df['summary'].apply(clean_text)

# Convertendo as classes em números
label_encoder = LabelEncoder()
df['category_encoded'] = label_encoder.fit_transform(df['category'])

# Separando X e y
X = df['cleaned_summary']
y = df['category_encoded']

# Dividindo os dados em treino e teste
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Pré-processamento concluído.")


Pré-processamento concluído.


### Etapa 2 - Processamento ML

**Extração de Features com Embeddings**

In [None]:
import tensorflow_hub as hub
import numpy as np

# Carregando o Universal Sentence Encoder para embeddings
embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")

# Função para converter textos em embeddings
def get_embeddings(texts):
    return np.array([embed([text])[0].numpy() for text in texts])

# Convertendo os textos em embeddings
X_train_embeddings = get_embeddings(X_train)
X_test_embeddings = get_embeddings(X_test)

print("Embeddings criados com sucesso.")


Embeddings criados com sucesso.


**Treinamento do Modelo**

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Treinando o modelo
clf = RandomForestClassifier()
clf.fit(X_train_embeddings, y_train)

# Avaliando o modelo
y_pred = clf.predict(X_test_embeddings)
accuracy = accuracy_score(y_test, y_pred)

print(f"Acurácia do modelo: {accuracy * 100:.2f}%")


Acurácia do modelo: 80.31%


In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Treinando o modelo
clf = DecisionTreeClassifier()
clf.fit(X_train_embeddings, y_train)

# Avaliando o modelo
y_pred = clf.predict(X_test_embeddings)
accuracy = accuracy_score(y_test, y_pred)

print(f"Acurácia do modelo: {accuracy * 100:.2f}%")


Acurácia do modelo: 69.69%


### Etapa 3

**Criação do Streamlit**

In [None]:
import joblib
joblib.dump(clf, 'modelo_treinado.pkl')

['modelo_treinado.pkl']