### OCI Data Science - Useful Tips
<details>
<summary><font size="2">Check for Public Internet Access</font></summary>

```python
import requests
response = requests.get("https://oracle.com")
assert response.status_code==200, "Internet connection failed"
```
</details>
<details>
<summary><font size="2">Helpful Documentation </font></summary>
<ul><li><a href="https://docs.cloud.oracle.com/en-us/iaas/data-science/using/data-science.htm">Data Science Service Documentation</a></li>
<li><a href="https://docs.cloud.oracle.com/iaas/tools/ads-sdk/latest/index.html">ADS documentation</a></li>
</ul>
</details>
<details>
<summary><font size="2">Typical Cell Imports and Settings for ADS</font></summary>

```python
%load_ext autoreload
%autoreload 2
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

import logging
logging.basicConfig(format='%(levelname)s:%(message)s', level=logging.ERROR)

import ads
from ads.dataset.factory import DatasetFactory
from ads.automl.provider import OracleAutoMLProvider
from ads.automl.driver import AutoML
from ads.evaluations.evaluator import ADSEvaluator
from ads.common.data import ADSData
from ads.explanations.explainer import ADSExplainer
from ads.explanations.mlx_global_explainer import MLXGlobalExplainer
from ads.explanations.mlx_local_explainer import MLXLocalExplainer
from ads.catalog.model import ModelCatalog
from ads.common.model_artifact import ModelArtifact
```
</details>
<details>
<summary><font size="2">Useful Environment Variables</font></summary>

```python
import os
print(os.environ["NB_SESSION_COMPARTMENT_OCID"])
print(os.environ["PROJECT_OCID"])
print(os.environ["USER_OCID"])
print(os.environ["TENANCY_OCID"])
print(os.environ["NB_REGION"])
```
</details>

In [1]:
import pandas as pd

In [2]:
# Paths to your CSV files
file_paths = ['Inputs/UGOF.UL_INVOICE_PROCESS_REPORT_V1 (3).csv', 'Inputs/UGOF.UL_INVOICE_PROCESS_REPORT_V1 (4).csv', 'Inputs/UGOF.UL_INVOICE_PROCESS_REPORT_V1 (5).csv', 
              'Inputs/UGOF.UL_INVOICE_PROCESS_REPORT_V1 (6).csv']

# Load each file into a DataFrame and store them in a list
dataframes = [pd.read_csv(file) for file in file_paths]

# Combine all DataFrames into one
combined_df = pd.concat(dataframes, ignore_index=True)

  dataframes = [pd.read_csv(file) for file in file_paths]
  dataframes = [pd.read_csv(file) for file in file_paths]
  dataframes = [pd.read_csv(file) for file in file_paths]
  dataframes = [pd.read_csv(file) for file in file_paths]


In [3]:
combined_df.shape

(165326, 79)

In [7]:
combined_df.isnull().sum().T

P_INV_START_DATE            0
P_INV_END_DATE              0
INVOICE_ID                  2
INV                         0
LANE_ID                    29
                        ...  
INV_COMMENTS           163828
L_INV_LINE_COMMENTS    165306
CONTRACT_NO            100773
INV_REASON_CODE        165177
INV_CATEGORY           165184
Length: 79, dtype: int64

In [8]:
from sklearn.model_selection import train_test_split

# First, split into 80% for (train + test) and 20% for validation
train_test_df, validation_df = train_test_split(combined_df, test_size=0.2, random_state=42)

In [9]:
# Save the training, testing, and validation data as CSV files
train_test_df.to_csv('train_test.csv', index=False)
validation_df.to_csv('validation.csv', index=False)