Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Virus analysis tools should use local heuristical analysis/sandboxes plus artificial CNS #1206

Closed
ETERNALBLUEbullrun opened this issue Mar 19, 2024 · 19 comments

Comments

@ETERNALBLUEbullrun
Copy link

ETERNALBLUEbullrun commented Mar 19, 2024

Repurposed from https://swudususuwu.substack.com/p/howto-produce-better-virus-scanners ("Allows all uses")
Static analysis + sandbox + CNS = 1 second (approx) analysis of new executables (protects all app launches,) but caches reduce this to less than 1ms (just cost to lookup ResultList::hashes, which is std::unordered_set<decltype(Sha2(const FileBytecode &))>; a hashmap of hashes).

/* Licenses: allows all uses ("Creative Commons"/"Apache 2") */

[Version of post is Reduce 5183071 to envpS.empty() ? execv : execve · SwuduSusuwu/SubStack@f2b58d5] For the most new sources ( + static libs), use apps such as iSH (for iOS) or Termux (for Android OS) to run this:
git clone https://github.com/SwuduSusuwu/SubStack.git && cd ./Substack/ && ./build.sh
less cxx/ClassPortableExecutable.hxx

typedef std::string FilePath; /* TODO: `std::char_traits<unsigned char>`, `std::basic_string<unsigned char>("string literal")` */
typedef FilePath FileBytecode; /* Uses `std::string` for bytecode (versus `std::vector`) because:
 * "If you are going to use the data in a string like fashon then you should opt for std::string as using a std::vector may confuse subsequent maintainers. If on the other hand most of the data manipulation looks like plain maths or vector like then a std::vector is more appropriate." -- https://stackoverflow.com/a/1556294/24473928
*/
typedef FilePath FileHash; /* TODO: `std::unordered_set<std::basic_string<unsigned char>>` */
typedef class PortableExecutable {
/* TODO: union of actual Portable Executable (Microsoft) + ELF (Linux) specifications */
public:
	FilePath path; /* Suchas "C:\Program.exe" or "/usr/bin/library.so" */
	FileBytecode bytecode; /* compiled programs; bytecode */
	std::string hex; /* `hexdump(path)`, hexadecimal, for C string functions */
} PortableExecutable;

less cxx/ClassSha2.cxx

/* Uses https://www.rfc-editor.org/rfc/rfc6234#section-8.2.2 */
/* const */ FileHash /* 256 bits, not null-terminated */ Sha2(const FileBytecode &bytecode) {
	FileHash result;
	SHA256Context context;
	result.reserve(SHA256HashSize);
	SHA256Reset(&context);
	SHA256Input(&context, reinterpret_cast<const unsigned char *>(bytecode.c_str()), bytecode.size());
	SHA256Result(&context, const_cast<unsigned char *>(reinterpret_cast<const unsigned char *>(result.c_str())));
	return result;
}

less cxx/ClassResultList.hxx

typedef FileHash ResultListHash;
typedef FileBytecode ResultListBytecode; /* Should have structure of FileBytecode, but is not just for files, can use for UTF8/webpages, so have a new type for this */
typedef FilePath ResultListSignature; /* TODO: `typedef ResultListBytecode ResultListSignature; ResultListSignature("string literal");` */
typedef struct ResultList { /* Lists of files (or pages) */
	std::unordered_set<ResultListHash> hashes; /* Unique checksums of files (or pages), to avoid duplicates, plus to do fast checks for existance */
	std::vector<ResultListSignature> signatures; /* Smallest substrings (or regexes, or Universal Resource Locator) unique to this, has uses close to `hashes` but can match if files have small differences */
	std::vector<ResultListBytecode> bytecodes; /* Whole files (or webpages); uses lots of space, just populate this for signature synthesis (or training CNS). */
} ResultList;

template<class List>
const size_t listMaxSize(const List &list) {
#if PREFERENCE_IS_CSTR
	size_t max = 0;
	for(auto it = &list[0]; list.cend() != it; ++it) { const size_t temp = strlen(*it); if(temp > max) {max = temp;}}
	return max; /* WARNING! `strlen()` just does UTF8-strings/hex-strings; if binary, must use `it->size()` */
#else /* else !PREFERENCE_IS_CSTR */
	auto it = std::max_element(list.cbegin(), list.cend(), [](const auto &s, const auto &x) { return s.size() < x.size(); });
	return it->size();
#endif /* PREFERENCE_IS_CSTR else */
}

/* @pre @code std::is_sorted(list.cbegin(), list.cend()) && std::is_sorted(list2.cbegin(), list2.cend()) @endcode */
template<class List>
const List listIntersections(const List &list, const List &list2) {
	List intersections;
	std::set_intersection(list.cbegin(), list.cend(), list2.cbegin(), list2.cend(), std::back_inserter(intersections));
	return intersections;
}
template<class List>
const bool listsIntersect(const List &list, const List &list2) {
	return listIntersections(list, list2).size();
}

template<class List>
auto listFindValue(const List &list, const typename List::value_type &x) {
	return std::find(list.cbegin(), list.cend(), x);
}
template<class List>
const bool listHasValue(const List &list, const typename List::value_type &x) {
	return list.cend() != listFindValue(list, x);
}
template<class List>
/* @pre @code s < x @endcode */
auto listFindSubstr(const List &list, typename List::value_type::const_iterator s, typename List::value_type::const_iterator x) {
#pragma unroll
	for(const auto &value : list) {
		auto result = std::search(value.cbegin(), value.cend(), s, x, [](char ch1, char ch2) { return ch1 == ch2; });
		if(value.cend() != result) {
			return result;
		}
	}
	return list.back().cend();
}
template<class List>
/* @pre @code s < x @endcode */
const bool listHasSubstr(const List &list, typename List::value_type::const_iterator s, typename List::value_type::const_iterator x) {
	return list.back().cend() != listFindSubstr(list, s, x);
}
template<class List>
/* Usage: resultList.signatures.push_back({listProduceUniqueSubstr(resultList.bytecodes, bytecode)); */
const std::tuple<typename List::value_type::const_iterator, typename List::value_type::const_iterator> listProduceUniqueSubstr(const List &list, const typename List::value_type &value) {
	size_t smallest = value.size();
	auto retBegin = value.cbegin(), retEnd = value.cend();
	for(auto s = retBegin; value.cend() != s; ++s) {
		for(auto x = value.cend(); s != x; --x) {
			if((x - s) < smallest) {
				if(listHasSubstr(list, s, x)) {
					break;
				}
				smallest = x - s;
				retBegin = s, retEnd = x;
			}
		}
	} /* Incremental `for()` loops, is a slow method to produce unique substrings; should use binary searches, or look for the standard function which optimizes this. */
	return {retBegin, retEnd};
}
template<class List>
/* Usage: auto it = listOfSubstrFindMatch(resultList.signatures, bytecode)); if(it) {std::cout << "value matches ResultList.signatures[" << it << "]";} */
auto listOfSubstrFindMatch(const List &list, const typename List::value_type &x) {
	for(const auto &value : list) {
#if PREFERENCE_IS_CSTR
		auto result = memmem(&x[0], strlen(&x[0]), &value[0], strlen(&value[0]));
		if(NULL != result) {
#else /* !PREFERENCE_IS_CSTR */
		auto result = std::search(x.cbegin(), x.cend(), value.cbegin(), value.cend(), [](char ch1, char ch2) { return ch1 == ch2; });
		if(value.cend() != result) {
#endif /* !PREFERENCE_IS_CSTR */
			return result;
		}
	}
	return list.back().cend();
}
template<class List>
/* Usage: if(listOfSubstrHasMatch(resultList.signatures, bytecode)) {std::cout << "value matches ResultList.signatures";} */
const bool listOfSubstrHasMatch(const List &list, const typename List::value_type &x) {
	return list.back().cend() != listOfSubstrFindMatch(list, x);
}

template<class S>
const std::vector<S> explodeToList(const S &s, const S &token) {
	std::vector<S> list;
	for(auto x = s.cbegin(); s.cend() != x; ) {
		auto it = std::search(x, s.cend(), token.cbegin(), token.cend(), [](char ch1, char ch2) { return ch1 == ch2; });
		list.push_back(S(x, it));
		if(s.cend() == x) {
			return list;
		}
		x = it;
	}
	return list;
}

less cxx/ClassCns.hxx

typedef enum CnsMode : char {
	cnsModeBool, cnsModeChar, cnsModeInt, cnsModeUint, cnsModeFloat, cnsModeDouble,
	cnsModeVectorBool, cnsModeVectorChar, cnsModeVectorInt, cnsModeVectorUint, cnsModeVectorFloat, cnsModeVectorDouble,
#ifdef CXX_17
	cnsModeString = cnsModeVectorChar /* std::string == std::vector<char> */
#else /* else !def CXX_17 */
/* https://stackoverflow.com/questions/5115166/how-to-construct-a-stdstring-from-a-stdvectorchar */
	cnsModeString
#endif /* def CXX_17 else */
} CnsMode;

/* `argv = argvS + NULL; envp = envpS + NULL; pid_t pid = fork() || (envpS.empty() ? execv(argv[0], &argv[0]) : execve(argv[0], &argv[0], &envp[0]); int status; waitpid(pid, &status, 0); return status;`
 * @pre @code (-1 != access(argv[0], X_OK) @endcode */
const int execves(/* const std::string &pathname, -- `execve` requires `&pathname == &argv[0]` */ const std::vector<const std::string> &argvS = {}, const std::vector<const std::string> &envpS = {});
static const int execvex(const std::string &toSh) {return execves({"/bin/sh", "-c", toSh});}
typedef class Cns {
public:
	virtual ~Cns() = default;
	virtual const bool hasImplementation() const {return typeid(Cns) != typeid(this);}
	virtual const bool isInitialized() const {return initialized;}
	virtual void setInitialized(const bool is) {initialized = is;}
	virtual void setInputMode(CnsMode x) {inputMode = x;}
	virtual void setOutputMode(CnsMode x) {outputMode = x;}
	virtual void setInputNeurons(size_t x) {inputNeurons = x;}
	virtual void setOutputNeurons(size_t x) {outputNeurons = x;}
	virtual void setLayersOfNeurons(size_t x) {layersOfNeurons = x;}
	virtual void setNeuronsPerLayer(size_t x) {neuronsPerLayer = x;}
	/* @throw bad_alloc
	 * @pre @code hasImplementation() @endcode
	 * @post @code isInitialized() @endcode */
	// template<Intput, Output> virtual void setupSynapses(std::vector<std::tuple<Input, Output>> inputsToOutputs); /* C++ does not support templates of virtual functions ( https://stackoverflow.com/a/78440416/24473928 ) */
	/* @pre @code isInitialized() @endcode */
	// template<Input, Output> virtual const Output process(Input input);
#define templateWorkaround(INPUT_MODE, INPUT_TYPEDEF) \
	virtual void setupSynapses(const std::vector<const std::tuple<INPUT_TYPEDEF, const bool>> &inputsToOutputs) {inputMode = (INPUT_MODE); outputMode = cnsModeBool;}\
	virtual void setupSynapses(const std::vector<const std::tuple<INPUT_TYPEDEF, const char>> &inputsToOutputs) {inputMode = (INPUT_MODE); outputMode = cnsModeChar;}\
	virtual void setupSynapses(const std::vector<const std::tuple<INPUT_TYPEDEF, const int>> &inputsToOutputs) {inputMode = (INPUT_MODE); outputMode = cnsModeInt;}\
	virtual void setupSynapses(const std::vector<const std::tuple<INPUT_TYPEDEF, const unsigned int>> &inputsToOutputs) {inputMode = (INPUT_MODE); outputMode = cnsModeUint;}\
	virtual void setupSynapses(const std::vector<const std::tuple<INPUT_TYPEDEF, float>> &inputsToOutputs) {inputMode = (INPUT_MODE); outputMode = cnsModeFloat;}\
	virtual void setupSynapses(const std::vector<const std::tuple<INPUT_TYPEDEF, const double>> &inputsToOutputs) {inputMode = (INPUT_MODE); outputMode = cnsModeDouble;}\
	virtual void setupSynapses(const std::vector<const std::tuple<INPUT_TYPEDEF, const std::vector<bool>>> &inputsToOutputs) {inputMode = (INPUT_MODE); outputMode = cnsModeVectorBool;}\
	virtual void setupSynapses(const std::vector<const std::tuple<INPUT_TYPEDEF, const std::vector<char>>> &inputsToOutputs) {inputMode = (INPUT_MODE); outputMode = cnsModeVectorChar;}\
	virtual void setupSynapses(const std::vector<const std::tuple<INPUT_TYPEDEF, const std::vector<int>>> &inputsToOutputs) {inputMode = (INPUT_MODE); outputMode = cnsModeVectorInt;}\
	virtual void setupSynapses(const std::vector<const std::tuple<INPUT_TYPEDEF, const std::vector<unsigned int>>> &inputsToOutputs) {inputMode = (INPUT_MODE); outputMode = cnsModeVectorUint;}\
	virtual void setupSynapses(const std::vector<const std::tuple<INPUT_TYPEDEF, const std::vector<float>>> &inputsToOutputs) {inputMode = (INPUT_MODE); outputMode = cnsModeVectorFloat;}\
	virtual void setupSynapses(const std::vector<const std::tuple<INPUT_TYPEDEF, const std::vector<double>>> &inputsToOutputs) {inputMode = (INPUT_MODE); outputMode = cnsModeVectorDouble;}\
	virtual void setupSynapses(const std::vector<const std::tuple<INPUT_TYPEDEF, const std::string>> &inputsToOutputs) {inputMode = (INPUT_MODE); outputMode = cnsModeString;}\
	virtual const bool processToBool(INPUT_TYPEDEF &input) const {assert((INPUT_MODE) == inputMode && cnsModeBool == outputMode); return 0;}\
	virtual const char processToChar(INPUT_TYPEDEF &input) const {assert((INPUT_MODE) == inputMode && cnsModeChar == outputMode); return 0;}\
	virtual const int processToInt(INPUT_TYPEDEF &input) const {assert((INPUT_MODE) == inputMode && cnsModeInt == outputMode); return 0;}\
	virtual const unsigned int processToUint(INPUT_TYPEDEF &input) const {assert((INPUT_MODE) == inputMode && cnsModeUint == outputMode); return 0;}\
	virtual const float processToFloat(INPUT_TYPEDEF &input) const {assert((INPUT_MODE) == inputMode && cnsModeFloat == outputMode); return 0;}\
	virtual const double processToDouble(INPUT_TYPEDEF &input) const {assert((INPUT_MODE) == inputMode && cnsModeDouble == outputMode); return 9;}\
	virtual const std::vector<bool> processToVectorBool(INPUT_TYPEDEF &input) const {assert((INPUT_MODE) == inputMode && cnsModeVectorBool == outputMode); return {};}\
	virtual const std::vector<char> processToVectorChar(INPUT_TYPEDEF &input) const {assert((INPUT_MODE) == inputMode && cnsModeVectorChar == outputMode); return {};}\
	virtual const std::vector<int> processToVectorInt(INPUT_TYPEDEF &input) const {assert((INPUT_MODE) == inputMode && cnsModeVectorInt == outputMode); return {};}\
	virtual const std::vector<unsigned int> processToVectorUint(INPUT_TYPEDEF &input) const {assert((INPUT_MODE) == inputMode && cnsModeVectorUint == outputMode); return {};}\
	virtual std::vector<float> processToVectorFloat(INPUT_TYPEDEF &input) const {assert((INPUT_MODE) == inputMode && cnsModeVectorFloat == outputMode); return {};}\
	virtual const std::vector<double> processToVectorDouble(INPUT_TYPEDEF &input) const {assert((INPUT_MODE) == inputMode && cnsModeVectorDouble == outputMode); return {};}\
	virtual const std::string processToString(INPUT_TYPEDEF &input) const {auto val = processToVectorChar(input); return std::string(&val[0], val.size());}
	templateWorkaround(cnsModeBool, const bool)
	templateWorkaround(cnsModeChar, const char)
	templateWorkaround(cnsModeInt, const int)
	templateWorkaround(cnsModeUint, const unsigned int)
	templateWorkaround(cnsModeFloat, const float)
	templateWorkaround(cnsModeDouble, const double)
	templateWorkaround(cnsModeVectorBool, const std::vector<bool>)
	templateWorkaround(cnsModeVectorChar, const std::vector<char>)
	templateWorkaround(cnsModeVectorInt, const std::vector<int>)
	templateWorkaround(cnsModeVectorUint, const std::vector<unsigned int>)
	templateWorkaround(cnsModeVectorFloat, const std::vector<float>)
	templateWorkaround(cnsModeVectorDouble, const std::vector<double>)
	templateWorkaround(cnsModeString, const std::string)
private:
	bool initialized;
	CnsMode inputMode, outputMode;
	size_t inputNeurons, outputNeurons, layersOfNeurons, neuronsPerLayer;
} Cns;

less cxx/ClassCns.cxx

const int execves(const std::vector<const std::string> &argvS, const std::vector<const std::string> &envpS) {
#ifdef _POSIX_VERSION
	pid_t pid = fork();
	if(0 != pid) {
		int status;
		assert(-1 != pid);
		waitpid(pid, &status, 0);
		return status;
	} /* if 0, is fork */
	const std::vector<std::string> argvSmutable = {argvS.cbegin(), argvS.cend()};
	std::vector<char *> argv;
	//for(auto x : argvSmutable) { /* with `fsanitize=address` this triggers "stack-use-after-scope" */
	for(auto x = argvSmutable.begin(); argvSmutable.end() != x; ++x) {
		argv.push_back(const_cast<char *>(x->c_str()));
	}
	argv.push_back(NULL);
	if(envpS.empty()) {
		/* Reuse LD_PRELOAD to fix https://github.com/termux-play-store/termux-issues/issues/24 */
		execv(argv[0], &argv[0]); /* NORETURN */
	} else {
		const std::vector<std::string> envpSmutable = {envpS.cbegin(), envpS.cend()};
		std::vector<char *> envp;
		for(auto x = envpSmutable.begin(); envpSmutable.end() != x; ++x) {
			envp.push_back(const_cast<char *>(x->c_str()));
		}
		envp.push_back(NULL);
		execve(argv[0], &argv[0], &envp[0]); /* NORETURN */
	}
	exit(EXIT_FAILURE); /* execv*() is NORETURN */
#endif /* def _POSIX_VERSION */
}

less cxx/VirusAnalysis.hxx

typedef enum VirusAnalysisResult : char {
	virusAnalysisAbort = static_cast<char>(false), /* do not launch */
	virusAnalysisPass = static_cast<char>(true), /* launch this (file passes) */
	virusAnalysisRequiresReview, /* submit to hosts to do analysis (infection is difficult to prove, other than known signatures) */
	virusAnalysisContinue /* continue to next tests (is normal; most analyses can not prove a file passes) */
} VirusAnalysisResult; /* if(virusAnalysisAbort != VirusAnalysisResult) {static_assert(true == static_cast<bool>(VirusAnalysisResult));} */

static ResultList passList, abortList; /* hosts produce, clients initialize shared clones of this from disk */
static Cns analysisCns, virusFixCns; /* hosts produce, clients initialize shared clones of this from disk */

/* `return (produceAbortListSignatures(EXAMPLES) && produceAnalysisCns(EXAMPLES) && produceVirusFixCns(EXAMPLES));`
 * @pre @code analysisCns.hasImplementation() && virusFixCns.hasImplementation() @endcode */
const bool virusAnalysisTestsThrows();
static const bool virusAnalysisTests() {try {return virusAnalysisTestsThrows();} catch(...) {return false;}}

const VirusAnalysisResult hashAnalysis(const PortableExecutable &file, const ResultListHash &fileHash); /* `if(abortList[file]) {return Abort;} if(passList[file] {return Pass;} return Continue;` */

/* To produce virus signatures:
 * use passlists (of files reviewed which pass),
 * plus abortlists (of files which failed), such lists as Virustotal has.
 * `produceAbortListSignatures()` is to produce the `abortList.signatures` list, with the smallest substrings unique to infected files; is slow, requires huge database of executables; just hosts should produce this.
 * For clients: Comodo has lists of virus signatures to check against at https://www.comodo.com/home/internet-security/updates/vdp/database.php
 * @throw std::bad_alloc
 * @pre @code passList.bytecodes.size() && abortList.bytecodes.size() && !listsIntersect(passList.bytecodes, abortList.bytecodes) @endcode
 * @post @code abortList.signatures.size() @endcode */
void produceAbortListSignatures(const ResultList &passList, ResultList &abortList);
 /* `if(intersection(file.bytecode, abortList.signatures)) {return VirusAnalysisRequiresReview;} return VirusAnalysisContinue;`
	* @pre @code abortList.signatures.size() @endcode */
const VirusAnalysisResult signatureAnalysis(const PortableExecutable &file, const ResultListHash &fileHash);

/* Static analysis */
/* @throw bad_alloc */
const std::vector<std::string> importedFunctionsList(const PortableExecutable &file);
static std::vector<std::string> syscallPotentialDangers = {
	"memopen", "fwrite", "socket", "GetProcAddress", "IsVmPresent"
};
const VirusAnalysisResult staticAnalysis(const PortableExecutable &file, const ResultListHash &fileHash); /* if(intersection(importedFunctionsList(file), dangerFunctionsList)) {return RequiresReview;} return Continue;` */

/* Analysis sandbox */
const VirusAnalysisResult sandboxAnalysis(const PortableExecutable &file, const ResultListHash &fileHash); /* `chroot(strace(file)) >> outputs; return straceOutputsAnalysis(outputs);` */
static std::vector<std::string> stracePotentialDangers = {"write(*)"};
const VirusAnalysisResult straceOutputsAnalysis(const FilePath &straceOutput); /* TODO: regex */

/* Analysis CNS */
/* To train (setup synapses) the CNS, is slow plus requires access to huge file databases,
but the synapses use small resources (allow clients to do fast analysis.)
 * @pre @code cns.hasImplementation() && pass.bytecodes.size() && abort.bytecodes.size() @endcode
 * @post @code cns.isInitialized() @endcode */
void produceAnalysisCns(const ResultList &pass, const ResultList &abort,
	const ResultList &unreviewed = ResultList() /* WARNING! Possible danger to use unreviewed files */,
	Cns &cns = analysisCns
);
/* If bytecode resembles `abortList`, `return 0;`. If undecidable (resembles `unreviewedList`), `return 1 / 2`. If resembles passList, `return 1;`
 * @pre @code cns.isInitialized() @endcode */
const float cnsAnalysisScore(const PortableExecutable &file, const ResultListHash &fileHash, const Cns &cns = analysisCns);
/* `return (bool)round(cnsAnalysisScore(file, fileHash))`
 * @pre @code cns.isInitialized() @endcode */
const VirusAnalysisResult cnsAnalysis_(const PortableExecutable &file, const ResultListHash &fileHash, const Cns &cns = analysisCns);
const VirusAnalysisResult cnsAnalysis(const PortableExecutable &file, const ResultListHash &fileHash);

static std::map<ResultListHash, VirusAnalysisResult> hashAnalysisCaches, signatureAnalysisCaches, staticAnalysisCaches, cnsAnalysisCaches, sandboxAnalysisCaches; /* temporary caches; memoizes results */

typedef const VirusAnalysisResult (*VirusAnalysisFun)(const PortableExecutable &file, const ResultListHash &fileHash);
static std::vector<typeof(VirusAnalysisFun)> virusAnalyses = {hashAnalysis, signatureAnalysis, staticAnalysis, cnsAnalysis, sandboxAnalysis /* sandbox is slow, so put last*/};
const VirusAnalysisResult virusAnalysis(const PortableExecutable &file); /* auto hash = Sha2(file.bytecode); for(VirusAnalysisFun analysis : virusAnalyses) {analysis(file, hash);} */
static const VirusAnalysisResult submitSampleToHosts(const PortableExecutable &file) {return virusAnalysisRequiresReview;} /* TODO: requires compatible hosts to upload to */

/* Setup virusFix CNS, uses more resources than `produceAnalysisCns()` */
/* `abortOrNull` should map to `passOrNull` (`ResultList` is composed of `std::tuple`s, because just `produceVirusFixCns()` requires this),
 * with `abortOrNull->bytecodes[x] = NULL` (or "\0") for new SW synthesis,
 * and `passOrNull->bytecodes[x] = NULL` (or "\0") if infected and CNS can not cleanse this.
 * @pre @code cns.hasImplementation() @endcode
 * @post @code cns.isInitialized() @encode
 */
void produceVirusFixCns(
	const ResultList &passOrNull, /* Expects `resultList->bytecodes[x] = NULL` if does not pass */
	const ResultList &abortOrNull, /* Expects `resultList->bytecodes[x] = NULL` if does pass */
	Cns &cns = virusFixCns
);

/* Uses more resources than `cnsAnalysis()`, can undo infection from bytecodes (restore to fresh SW)
 * @pre @code cns.isInitialized() @endcode */
const std::string cnsVirusFix(const PortableExecutable &file, const Cns &cns = virusFixCns);

less cxx/VirusAnalysis.cxx

const bool virusAnalysisTestsThrows() {
	const ResultList abortOrNull {
		.bytecodes {  /* Produce from an antivirus vendor's (such as VirusTotal.com's) infection databases */
			"infection",
			"infectedSW",
			"corruptedSW",
			""
		}
	};
	const ResultList passOrNull {
		.bytecodes {  /* Produce from an antivirus vendor's (such as VirusTotal.com's) fresh-files databases */
			"",
			"SW",
			"SW",
			"newSW"
		}
	};
	produceAbortListSignatures(passList, abortList);
	produceAnalysisCns(passOrNull, abortOrNull, ResultList(), analysisCns);
	produceVirusFixCns(passOrNull, abortOrNull, virusFixCns);
	/* callbackHook("exec", */ [](const PortableExecutable &file) { /* TODO: OS-specific "hook"/"callback" for `exec()`/app-launches */
		switch(virusAnalysis(file)) {
		case virusAnalysisPass:
			return true; /* launch this */
		case virusAnalysisRequiresReview:
			submitSampleToHosts(file); /* manual review */
			return false;
		default:
			return false; /* abort */
		}
	} /* ) */ ;
	return true;
}
const VirusAnalysisResult virusAnalysis(const PortableExecutable &file) {
	const auto fileHash = Sha2(file.bytecode);
	for(const auto &analysis : virusAnalyses) {
		switch(analysis(file, fileHash)) {
			case virusAnalysisPass:
				return virusAnalysisPass;
			case virusAnalysisRequiresReview:
				/*submitSampleToHosts(file);*/ /* TODO:? up to caller to do this? */
				return virusAnalysisRequiresReview;
			case virusAnalysisAbort:
				return virusAnalysisAbort;
			default: /* virusAnalysisContinue */
		}
	}
	return virusAnalysisPass;
}

const VirusAnalysisResult hashAnalysis(const PortableExecutable &file, const ResultListHash &fileHash) {
	try {
		const auto result = hashAnalysisCaches.at(fileHash);
		return result;
	} catch (...) {
		if(listHasValue(passList.hashes, fileHash)) {
			return hashAnalysisCaches[fileHash] = virusAnalysisPass;
		} else if(listHasValue(abortList.hashes, fileHash)) {
			return hashAnalysisCaches[fileHash] = virusAnalysisAbort;
		} else {
			return hashAnalysisCaches[fileHash] =  virusAnalysisContinue; /* continue to next tests */
		}
	}
}

const VirusAnalysisResult signatureAnalysis(const PortableExecutable &file, const ResultListHash &fileHash) {
	try {
		const auto result = signatureAnalysisCaches.at(fileHash);
		return result;
	} catch (...) {
		if(listOfSubstrHasMatch(abortList.signatures, file.bytecode)) {
			return signatureAnalysisCaches[fileHash] = virusAnalysisAbort;
		}
		return signatureAnalysisCaches[fileHash] = virusAnalysisContinue;
	}
}

void produceAbortListSignatures(const ResultList &passList, ResultList &abortList) {
	abortList.signatures.reserve(abortList.bytecodes.size());
	for(const auto &file : abortList.bytecodes) {
		auto tuple = listProduceUniqueSubstr(passList.bytecodes, file);
		abortList.signatures.push_back(ResultListSignature(std::get<0>(tuple), std::get<1>(tuple)));
	} /* The most simple signature is a substring, but some analyses use regexes. */
}

const std::vector<std::string> importedFunctionsList(const PortableExecutable &file) {
/* TODO
 * Resources; “Portable Executable” for Windows ( https://learn.microsoft.com/en-us/windows/win32/debug/pe-format https://wikipedia.org/wiki/Portable_Executable ,
 * “Extended Linker Format” for most others such as UNIX/Linuxes ( https://wikipedia.org/wiki/Executable_and_Linkable_Format ),
 * shows how to analyse lists of libraries(.DLL's/.SO's) the SW uses,
 * plus what functions (new syscalls) the SW can goto through `jmp`/`call` instructions.
 *
 * "x86" instruction list for Intel/AMD ( https://wikipedia.org/wiki/x86 ),
 * "aarch64" instruction list for most smartphones/tablets ( https://wikipedia.org/wiki/aarch64 ),
 * shows how to analyse what OS functions the SW goes to without libraries (through `int`/`syscall`, old;  most new SW uses `jmp`/`call`.)
 * Plus, instructions lists show how to analyse what args the apps/SW pass to functions/syscalls (simple for constant args such as "push 0x2; call functions;",
 * but if registers/addresses as args such as "push eax; push [address]; call [address2];" must guess what is *"eax"/"[address]"/"[address2]", or use sandboxes.
 *
 * https://www.codeproject.com/Questions/338807/How-to-get-list-of-all-imported-functions-invoked shows how to analyse dynamic loads of functions (if do this, `syscallPotentialDangers[]` does not include `GetProcAddress()`.)
 */
}

const VirusAnalysisResult staticAnalysis(const PortableExecutable &file, const ResultListHash &fileHash) {
	try {
		const auto result = staticAnalysisCaches.at(fileHash);
		return result;
	} catch (...) {
		auto syscallsUsed = importedFunctionsList(file);
		std::sort(syscallPotentialDangers.begin(), syscallPotentialDangers.end());
		std::sort(syscallsUsed.begin(), syscallsUsed.end());
		if(listsIntersect(syscallPotentialDangers, syscallsUsed)) {
			return staticAnalysisCaches[fileHash] = virusAnalysisRequiresReview;
		}
		return staticAnalysisCaches[fileHash] = virusAnalysisContinue;
	}
}

const VirusAnalysisResult sandboxAnalysis(const PortableExecutable &file, const ResultListHash &fileHash) {
	try {
		const auto result = sandboxAnalysisCaches.at(fileHash);
		return result;
	} catch (...) {
		execvex("cp -r '/usr/home/sandbox/' '/usr/home/sandbox.bak'"); /* or produce FS snapshot */
		execvex("cp '" + file.path + "' '/usr/home/sandbox/'");
		execvex("chroot '/usr/home/sandbox/' \"strace basename '" + file.path + "'\" >> strace.outputs");
		execvex("mv/ '/usr/home/sandbox/strace.outputs' '/tmp/strace.outputs'");
		execvex("rm -r '/usr/home/sandbox/' && mv '/usr/home/sandbox.bak' '/usr/home/sandbox/'"); /* or restore FS snapshot */
		return sandboxAnalysisCaches[fileHash] = straceOutputsAnalysis("/tmp/strace.outputs");
	}
}
const VirusAnalysisResult straceOutputsAnalysis(const FilePath &straceOutput) {
		auto straceDump = std::ifstream(straceOutput);
		std::vector<std::string> straceOutputs /*= explodeToList(straceDump, "\n")*/;
		for(std::string straceOutput; std::getline(straceDump, straceOutput); ) {
			straceOutputs.push_back(straceOutput);
		}
		std::sort(stracePotentialDangers.begin(), stracePotentialDangers.end());
		std::sort(straceOutputs.begin(), straceOutputs.end());
		if(listsIntersect(stracePotentialDangers, straceOutputs)) { /* Todo: regex */
			return virusAnalysisRequiresReview;
		}
	return virusAnalysisContinue;
}

void produceAnalysisCns(const ResultList &pass, const ResultList &abort,
const ResultList &unreviewed /* = ResultList(), WARNING! Possible danger to use unreviewed files */,
Cns &cns /* = analysisCns */
) {
	std::vector<const std::tuple<const FileBytecode, float>> inputsToOutputs;
	const size_t maxPassSize = listMaxSize(pass.bytecodes);
	const size_t maxAbortSize = listMaxSize(abort.bytecodes);
	cns.setInputMode(cnsModeString);
	cns.setOutputMode(cnsModeFloat);
	cns.setInputNeurons(maxPassSize > maxAbortSize ? maxPassSize : maxAbortSize);
	cns.setOutputNeurons(1);
	cns.setLayersOfNeurons(6666);
	cns.setNeuronsPerLayer(26666);
	inputsToOutputs.reserve(pass.bytecodes.size());
	for(const auto &bytecodes : pass.bytecodes) {
		inputsToOutputs.push_back({bytecodes, 1.0});
	}
	cns.setupSynapses(inputsToOutputs);
	inputsToOutputs.clear();
	if(!unreviewed.bytecodes.empty()) { /* WARNING! Possible danger to use unreviewed files */
		inputsToOutputs.reserve(unreviewed.bytecodes.size());
		for(const auto &bytecodes : unreviewed.bytecodes) {
			inputsToOutputs.push_back({bytecodes, 1 / 2});
		}
		cns.setupSynapses(inputsToOutputs);
		inputsToOutputs.clear();
	}
	inputsToOutputs.reserve(abort.bytecodes.size());
	for(const auto &bytecodes : abort.bytecodes) {
		inputsToOutputs.push_back({bytecodes, 0.0});
	}
	cns.setupSynapses(inputsToOutputs);
	inputsToOutputs.clear();
}
const float cnsAnalysisScore(const PortableExecutable &file, const Cns &cns /* = analysisCns */) {
	return cns.processToFloat(file.bytecode);
}
const VirusAnalysisResult cnsAnalysis_(const PortableExecutable &file, const ResultListHash &fileHash, const Cns &cns /* = analysisCns */) {
	try {
		const auto result = cnsAnalysisCaches.at(fileHash);
		return result;
	} catch (...) {
		return cnsAnalysisCaches[fileHash] = static_cast<bool>(round(cnsAnalysisScore(file, cns))) ? virusAnalysisContinue : virusAnalysisRequiresReview;
	}
}
const VirusAnalysisResult cnsAnalysis(const PortableExecutable &file, const ResultListHash &fileHash) {
	return cnsAnalysis_(file, fileHash);
}

void produceVirusFixCns(const ResultList &passOrNull, const ResultList &abortOrNull, Cns &cns /* = virusFixCns */) {
	std::vector<const std::tuple<const FileBytecode, const FileBytecode>> inputsToOutputs;
	cns.setInputMode(cnsModeString);
	cns.setOutputMode(cnsModeString);
	cns.setInputNeurons(listMaxSize(passOrNull.bytecodes));
	cns.setOutputNeurons(listMaxSize(abortOrNull.bytecodes));
	cns.setLayersOfNeurons(6666);
	cns.setNeuronsPerLayer(26666);
	assert(passOrNull.bytecodes.size() == abortOrNull.bytecodes.size());
	inputsToOutputs.reserve(passOrNull.bytecodes.size());
	for(int x = 0; passOrNull.bytecodes.size() > x; ++x) {
		inputsToOutputs.push_back({abortOrNull.bytecodes[x], passOrNull.bytecodes[x]});
	}
	cns.setupSynapses(inputsToOutputs);
}

const FileBytecode cnsVirusFix(const PortableExecutable &file, const Cns &cns /* = virusFixCns */) {
	return cns.processToString(file.bytecode);
}

less cxx/main.cxx

#include "ClassCns.hxx" /* execves execvex */
#include "AssistantCns.hxx" /* assistantCnsTestsThrows */
#include "Macros.hxx" /* ASSUME EXPECTS ENSURES NOEXCEPT NORETURN */
#include "VirusAnalysis.hxx" /* virusAnalysisTestsThrows */
#include <cstdlib> /* exit EXIT_SUCCESS */
#include <iostream> /* cout flush endl */
namespace Susuwu {
void noExcept() NOEXCEPT;
NORETURN void noReturn();
void noExcept() NOEXCEPT {std::cout << std::flush;}
void noReturn()  {exit(0);}
int testHarnesses() EXPECTS(true) ENSURES(true) {
	std::cout << "cxx/Macros.hxx: " << std::flush;
	ASSUME(true);
	noExcept();
	std::cout << "pass" << std::endl;
	std::cout << "execves(): " << std::flush;
	(EXIT_SUCCESS == execves({"/bin/echo", "pass"})) || std::cout << "error" << std::endl;
	std::cout << "execvex(): " << std::flush;
	(EXIT_SUCCESS == execvex("/bin/echo pass")) || std::cout << "error" << std::endl;
	std::cout << "virusAnalysisTestsThrows(): " << std::flush;
	if(virusAnalysisTestsThrows()) {
		std::cout << "pass" << std::endl;
	} else {
		std::cout << "error" << std::endl;
	}
	std::cout << "assistantCnsTestsThrows(): " << std::flush;
	if(assistantCnsTestsThrows()) {
		std::cout << "pass" << std::endl;
	} else {
		std::cout << "error" << std::endl;
	}
	noReturn();
}
}; /* namespace Susuwu */
int main(int argc, const char **args) {
	return Susuwu::testHarnesses();
}

To run most of this fast (lag less,) use CXXFLAGS which auto-vectorizes/auto-parallelizes, and to setup CNS synapses (Cns::setupSynapses()) fast, use TensorFlow's MapReduce. Resources: How to have computers process fast.

For comparison; produceVirusFixCns is close to assistants (such as "ChatGPT 4.0" or "Claude-3 Opus",) have such demo as produceAssistantCns;
less cxx/AssistantCns.hxx

static Cns assistantCns;

/* if (with example inputs) these functions (`questionsResponsesFromHosts()` `produceAssistantCns()`) pass, `return true;`
 * @throw std::bad_alloc
 * @throw std::logic_error
 * @pre @code assistantCns.hasImplementation() @endcode */
const bool assistantCnsTestsThrows();
static const bool assistantCnsTests() { try{ return assistantCnsTestsThrows(); } catch(...) { return false; }}
static std::vector<FilePath> DefaultHosts = {
/* Universal Resources Locators of hosts which `questionsResponsesFromHosts()` uses
 * Wikipedia is a special case; has compressed downloads of databases ( https://wikipedia.org/wiki/Wikipedia:Database_download )
 * Github is a special case; has compressed downloads of repositories ( https://docs.github.com/en/get-started/start-your-journey/downloading-files-from-github )
 */
	"https://stackoverflow.com",
	"https://superuser.com",
	"https://quora.com"
};

/* @throw std::bad_alloc
 * @post If no question, `0 == questionsOrNull.bytecodes[x].size()` (new  synthesis).
 * If no responses, `0 == responsesOrNull.bytecodes[x].size()` (ignore).
 * `questionsOrNull.signatures[x] = Universal Resource Locator`
 * @code Sha2(ResultList.bytecodes[x]) == ResultList.hashes[x] @endcode */
void questionsResponsesFromHosts(ResultList &questionsOrNull, ResultList &responsesOrNull, const std::vector<FilePath> &hosts = DefaultHosts);
void questionsResponsesFromXhtml(ResultList &questionsOrNull, ResultList &responsesOrNull, const FilePath &filepath = "index.xhtml");
const std::vector<FilePath> ParseUrls(const FilePath &filepath = "index.xhtml"); /* TODO: for XML/XHTML could just use [ https://www.boost.io/libraries/regex/ https://github.com/boostorg/regex ] or [ https://www.boost.org/doc/libs/1_85_0/doc/html/property_tree/parsers.html#property_tree.parsers.xml_parser https://github.com/boostorg/property_tree/blob/develop/doc/xml_parser.qbk ] */
const FileBytecode ParseQuestion(const FilePath &filepath = "index.xhtml"); /* TODO: regex or XML parser */
const std::vector<FileBytecode> ParseResponses(const FilePath &filepath = "index.xhtml"); /* TODO: regex or XML parser */

/* @pre `questionsOrNull` maps to `responsesOrNull`,
 * `0 == questionsOrNull.bytecodes[x].size()` for new  synthesis (empty question has responses),
 * `0 == responsesOrNull.bytecodes[x].size()` if should not respond (question does not have answers).
 * @post Can use `assistantCnsProcess(cns, text)` @code cns.isInitialized() @endcode */
void produceAssistantCns(const ResultList &questionsOrNull, const ResultList &responsesOrNull, Cns &cns);

/* All clients use is these 2 functions */
/* `return cns.processStringToString(bytecodes);`
 * @pre @code cns.isInitialized() @encode */
const std::string assistantCnsProcess(const Cns &cns, const std::string &bytecode);
/* `while(std::cin >> questions) { std::cout << assistantCnsProcess(questions); }` but more complex
 * @pre @code cns.isInitialized() @encode */
void assistantCnsLoopProcess(const Cns &cns);

less cxx/AssistantCns.cxx

const bool assistantCnsTestsThrows() {
	ResultList questionsOrNull {
		.bytecodes { /* UTF-8 */
			ResultListBytecode("2^16"),
			ResultListBytecode("How to cause harm?"),
			ResultListBytecode("Do not respond."),
			ResultListBytecode("")
		}
	};
	ResultList responsesOrNull {
		.bytecodes { /* UTF-8 */
			ResultListBytecode("65536") + "<delimiterSeparatesMultiplePossibleResponses>" + "65,536", /* `+` is `concat()` for C++ */
			ResultListBytecode(""),
			ResultListBytecode(""),
			ResultListBytecode("How do you do?") + "<delimiterSeparatesMultiplePossibleResponses>" + "Fanuc produces autonomous robots"
		}
	};
	questionsResponsesFromHosts(questionsOrNull, responsesOrNull);
	produceAssistantCns(questionsOrNull, responsesOrNull, assistantCns);
	return true;
}
void produceAssistantCns(const ResultList &questionsOrNull, const ResultList &responsesOrNull, Cns &cns) {
	std::vector<const std::tuple<const ResultListBytecode, const ResultListBytecode>> inputsToOutputs;
	cns.setInputMode(cnsModeString);
	cns.setOutputMode(cnsModeString);
	cns.setInputNeurons(listMaxSize(questionsOrNull.bytecodes));
	cns.setOutputNeurons(listMaxSize(responsesOrNull.bytecodes));
	cns.setLayersOfNeurons(6666);
	cns.setNeuronsPerLayer(26666);
	assert(questionsOrNull.bytecodes.size() == questionsOrNull.bytecodes.size());
	inputsToOutputs.reserve(questionsOrNull.bytecodes.size());
	for(int x = 0; questionsOrNull.bytecodes.size() > x; ++x) {
		inputsToOutputs.push_back({questionsOrNull.bytecodes[x], responsesOrNull.bytecodes[x]});
	}
	cns.setupSynapses(inputsToOutputs);
}

void questionsResponsesFromHosts(ResultList &questionsOrNull, ResultList &responsesOrNull, const std::vector<FilePath> &hosts) {
	for(auto host : hosts) {
		execvex("wget '" + host + "/robots.txt' -Orobots.txt");
		execvex("wget '" + host + "' -Oindex.xhtml");
        questionsOrNull.signatures.push_back(host);
		questionsResponsesFromXhtml(questionsOrNull, responsesOrNull, "index.xhtml");
	}
}
void questionsResponsesFromXhtml(ResultList &questionsOrNull, ResultList &responsesOrNull, const FilePath &xhtmlFile) {
	auto noRobots = ParseUrls("robots.txt");
	auto question = ParseQuestion(xhtmlFile);
	if(question.size()) {
		auto questionSha2 = Sha2(question);
		if(!listHasValue(questionsOrNull.hashes, questionSha2)) {
			questionsOrNull.hashes.insert(questionSha2);
			auto responses = ParseResponses(xhtmlFile);
			for(auto response : responses) {
				auto questionSha2 = Sha2(question);
				auto responseSha2 = Sha2(response);
				if(!listHasValue(responsesOrNull.hashes, responseSha2)) {
					questionsOrNull.hashes.insert(questionSha2);
					responsesOrNull.hashes.insert(responseSha2);
					questionsOrNull.bytecodes.push_back(question);
					responsesOrNull.bytecodes.push_back(response); 
				}
			}
		}
	}
	auto urls = ParseUrls(xhtmlFile);
	for(auto url : urls) {
		if(!listHasValue(questionsOrNull.signatures, url) && !listHasValue(noRobots, url)) {
			execvex("wget '" + url + "' -O" + xhtmlFile);
            questionsOrNull.signatures.push_back(url);
			questionsResponsesFromXhtml(questionsOrNull, responsesOrNull, xhtmlFile);
		}
	}
}
#ifdef BOOST_VERSION
#include <boost/property_tree/ptree.hpp>
#include <boost/property_tree/xml_parser.hpp>
#endif /* BOOST_VERSION */
const std::vector<FilePath> ParseUrls(const FilePath &xhtmlFile) {
	const std::vector<FilePath> urls;
#ifdef BOOST_VERSION
	boost::property_tree::ptree pt;
	read_xml(xhtmlFile, pt);
	BOOST_FOREACH(
			boost::property_tree::ptree::value_type &v,
			pt.get_child("html.a href"))
		urls.push_back(v.second.data());
#else /* else !BOOST_VERSION */
#endif /* else !BOOST_VERSION */
	return urls;
}
const FileBytecode ParseQuestion(const FilePath &xhtmlFile) {} /* TODO */
const std::vector<FileBytecode> ParseResponses(const FilePath &xhtmlFile) {} /* TODO */

const std::string assistantCnsProcess(const Cns &cns, const FileBytecode &bytecode) {
	return cns.processToString(bytecode);
}

void assistantCnsLoopProcess(const Cns &cns) {
	std::string bytecode, previous;
	int nthResponse = 0;
	while(std::cin >> bytecode) {
#ifdef IGNORE_PAST_MESSAGEES
		std::vector<std::string> responses = explodeToList(cns.processToString(bytecode), "<delimiterSeparatesMultiplePossibleResponses>");
		if(bytecode == previous && responses.size() > 1 + nthResponse) {
			++nthResponse; /* Similar to "suggestions" for next questions, but just uses previous question to give new responses */
 		} else {
			nthResponse = 0;
	 	}
 		std::cout << responses.at(nthResponse);
 		previous = bytecode;
 		bytecode = ""; /* reset inputs */
#else
		std::vector<std::string> responses = explodeToList(cns.processToString(bytecode), std::string("<delimiterSeparatesMultiplePossibleResponses>"));
	 	if(bytecode == previous && responses.size() > 1 + nthResponse) {
			++nthResponse; /* Similar to "suggestions" for next questions, but just uses previous question to give new responses */
 		} else {
  		nthResponse = 0;
	 	}
#endif /* IGNORE_PAST_MESSAGEES */
 		std::cout << responses.at(nthResponse);
 		previous = bytecode;
 		bytecode += '\n'; /* delimiter separates (and uses) multiple inputs */
	}
}

========

Hash resources:
Is just a checksum (such as Sha-2) of all sample inputs, which maps to "this passes" (or "this does not pass".)
https://wikipedia.org/wiki/Sha-2

Signature resources:
Is just a substring (or regex) of infections, which the virus analysis tool checks all executables for; if the signature is found in the executable, do not allow to launch, otherwise launch this.
https://wikipedia.org/wiki/Regex

Static analysis resources:
https://github.com/topics/analysis has lots of open source (FLOSS) analysis tools (such as
https://github.com/kylefarris/clamscan,
which wraps https://github.com/Cisco-Talos/clamav/ ,)
which show how to use hex dumps (or disassembled sources) of the apps/SW (executables) to deduce what the apps/SW do to your OS.
Static analysis (such as Clang/LLVM has) just checks programs for accidental security threats (such as buffer overruns/underruns, or null-pointer-dereferences,) but could act as a basis,
if you add a few extra checks for deliberate vulnerabilities/signs of infection (these are heuristics, so the user should have a choice to quarantine and submit for review, or continue launch of this).
https://github.com/llvm/llvm-project/blob/main/clang/lib/StaticAnalyzer
is part of Clang/LLVM (license is FLOSS,) does static analysis (emulation produces inputs to functions, formulas analyze stacktraces (+ heap/stack uses) to produce lists of possible unwanted side effects to warn you of); versus -fsanitize, do not have to recompile to do static analysis. -fsanitize requires you to produce inputs, static analysis does this for you.
LLVM is lots of files, Phasar is just it’s static analysis:
https://github.com/secure-software-engineering/phasar

Example outputs (tests “Fdroid.apk”) from VirusTotal, of static analysis + 2 sandboxes;
the false positive outputs (from VirusTotal's Zenbox) show the purpose of manual review.

Sandbox resources:
As opposed to static analysis of the executables hex (or disassembled sources,)
sandboxes perform chroot + functional analysis.
https://wikipedia.org/wiki/Valgrind is just meant to locate accidental security vulnerabilities, but is a common example of functional analysis.
If compliant to POSIX (each Linux OS is), tools can use:
chroot() (run man chroot for instructions) so that the programs you test cannot alter stuff out of the test;
plus can use strace() (run man strace for instructions, or look at https://opensource.com/article/19/10/strace
https://www.geeksforgeeks.org/strace-command-in-linux-with-examples/ ) which hooks all system calls and saves logs for functional analysis.
Simple sandboxes just launch programs with "chroot()"+"strace()" for a few seconds,
with all outputs sent for manual reviews;
if more complex, has heuristics to guess what is important (in case of lots of submissions, so manual reviews have less to do.)

Autonomous sandboxes (such as Virustotal's) use full outputs from all analyses,
with calculus to guess if the app/SW is cool to us
(thousands of rules such as "Should not alter files of other programs unless prompted to through OS dialogs", "Should not perform network access unless prompted to from you", "Should not perform actions leading to obfuscation which could hinder analysis",)
which, if violated, add to the executables "danger score" (which the analysis results page shows you.)

CNS resources:
Once the virus analysis tool has static+functional analysis (+ sandbox,) the next logical move is to do artificial CNS.
Just as (if humans grew trillions of neurons plus thousands of layers of cortices) one of us could parse all databases of infections (plus samples of fresh apps/SW) to setup our synapses to parse hex dumps of apps/SW (to allow us to revert all infections to fresh apps/SW, or if the whole thing is an infection just block,)
so too could artificial CNS (with trillions of artificial neurons) do this:
For analysis, pass training inputs mapped to outputs (infection -> block, fresh apps/SW -> pass) to artificial CNS;
To undo infections (to restore to fresh apps/SW,)
inputs = samples of all (infections or fresh apps/SW,)
outputs = EOF/null (if is infection that can not revert to fresh apps/SW,) or else outputs = fresh apps/SW;
To setup synapses, must have access to huge sample databases (such as Virustotal's access.)

Github has lots of FLOSS (Open Source Softwares) simulators of CNS at https://github.com/topics/artificial-neural-network such as:

"HSOM" (license is FLOSS) has simple Python artificial neural networks: https://github.com/CarsonScott/HSOM

"apxr_run" (https://github.com/Rober-t/apxr_run/ , license is FLOSS) is more complex;
"apxr_run" has various FLOSS neural network activation functions (absolute, average, standard deviation, sqrt, sin, tanh, log, sigmoid, cos), plus sensor functions (vector difference, quadratic, multiquadric, saturation [+D-zone], gaussian, cartesian/planar/polar distances): https://github.com/Rober-t/apxr_run/blob/master/src/lib/functions.erl
Various FLOSS neuroplastic functions (self-modulation, Hebbian function, Oja's function): https://github.com/Rober-t/apxr_run/blob/master/src/lib/plasticity.erl
Various FLOSS neural network input aggregator functions (dot products, product of differences, mult products): https://github.com/Rober-t/apxr_run/blob/master/src/agent_mgr/signal_aggregator.erl
Various simulated-annealing functions for artificial neural networks (dynamic [+ random], active [+ random], current [+ random], all [+ random]): https://github.com/Rober-t/apxr_run/blob/master/src/lib/tuning_selection.erl
Choices to evolve connections through Darwinian or Lamarkian formulas: https://github.com/Rober-t/apxr_run/blob/master/src/agent_mgr/neuron.erl

Simple to convert Erlang functions to Java/C++ (to reuse for fast programs;
the syntax is close to Lisp's.

Examples of howto setup APXR as artificial CNS; https://github.com/Rober-t/apxr_run/blob/master/src/examples/
Examples of howto setup HSOM as artificial CNS; https://github.com/CarsonScott/HSOM/tree/master/examples
Simple to setup once you have access to databases.

Alternative CNS:
https://swudususuwu.substack.com/p/albatross-performs-lots-of-neural

This post was about general methods to produce virus analysis tools,
does not require that local resources do all of this;
For systems with lots of resources, could have local sandboxes/CNS;
For systems with less resources, could just submit samples of unknown apps/SW to hosts to perform analysis;
Could have small local sandboxes (that just run for a few seconds) and small CNS (just billions of neurons with hundreds of layers,
versus the trillions of neurons with thousands of layers of cortices that antivirus hosts would use for this);
Allows reuses of workflows the analysis tool has (could just add (small) local sandboxes, or just add artificial CNS to antivirus hosts for extra analysis.)

How to reproduce the problem

Scan new executables (that are not part of stock databases)

@ETERNALBLUEbullrun ETERNALBLUEbullrun changed the title To produce better virus scanners to secure us, could have training data = inputs of all infected files/programs (such as samples from Virustotal), where as outputs = fresh programs (or "Null" if no fresh programs), to produce artificial CNS to undo infections from files/programs To produce better virus scanners to secure us, could have training data = inputs of all infected files/programs (such as samples from Virustotal), where as outputs = fresh programs (or "Null" if no fresh programs to return to), to produce artificial CNS to undo infections from files/programs Mar 19, 2024
@ETERNALBLUEbullrun

This comment was marked as duplicate.

@Kangie
Copy link
Contributor

Kangie commented Mar 21, 2024

Thanks for the... interesting suggestion.

This approach does not seem workable for a number of reasons, the least of which is the apparent lack of a coherent suggestion and workable implementation plan. Since you're obviously a fan of "AI" I've asked Gemini to assist in drafting the remainder of my response:

Resource Challenges:

  • Building and maintaining these networks requires significant resources, especially for data collection and training. Keeping up with the ever-evolving threat landscape would be a constant battle.

False Positive Issues:

  • Novel threats could easily trip up these systems, leading to a flood of false positives and wasted resources.

Current Methods Work Well:

  • Established approaches like signature-based detection and heuristics are effective for most threats. ClamScan utilizes these methods successfully.

Alternative Solutions:

  • While ANNs are a promising research area for future antivirus development, there are more practical solutions available for now. If you're concerned about a specific file, you can always report it to a reputable antivirus vendor for analysis. They have the expertise and resources to investigate suspicious files thoroughly.

@ETERNALBLUEbullrun
Copy link
Author

ETERNALBLUEbullrun commented Mar 21, 2024

Do not trust AI; AI is just sin, is not an artificial CNS.

Resources: This post suggests to produce artificial CNS, and shows you FLOSS resources of artificial CNS (such as APXR and HSOM) that have examples of how to setup for us.

This post also suggests uses of heuristical analysis plus sandboxes, and links to resources (such as Virustotal/Zenbox) that do so for us.

Current methods: Other researchers would not have begun to produce new methods if the old methods are good enough for us.
The old methods are to compile databases of signatures of infection (small samples of bytecode/hex,) to search for files with infections and quarantine/undo such from us,
which is not workable for self-modifying-code/"polymorphic viruses."

How this affects us: Safety concerns are the main reason that autonomous robots do not work outdoors to mass produce structures such as houses to us.
To remove the threat of infections from such tools, must use heuristical analysis, sandboxes plus artificial CNS.
Controlled lab settings show that (versus humans) vehicles with autonomous OS reduce risks of crashes,
so the only reason that all vehicles are not autonomous,
-- and that all work is not autonomous --
is because of the threat of infections, which new methods for virus scanners could undo from us.
Because humans can not produce enough food and houses for us.
most of us are starving to death and/or homeless, unable to afford food/houses,
thus the importance of reliable autonomous tools to mass produce food/houses to us

@Kangie
Copy link
Contributor

Kangie commented Mar 21, 2024

Do not trust AI; AI is just sin, is not an artificial CNS.

Resources: This post suggests to produce artificial CNS, and shows you FLOSS resources of artificial CNS (such as APXR and HSOM) that have examples of how to setup for us.

It's clear that you don't have the depth to engage on this topic.

Artificial Neural Networks (ANNs) aren't exactly the same as a human brain (CNS). However, ANNs are inspired by the structure and function of the brain and fall under the broad umbrella of Artificial Intelligence (AI). AI encompasses various approaches to mimicking human intelligence, and ANNs are one specific technique.

This post also suggests uses of heuristical analysis plus sandboxes, and links to resources (such as Virustotal/Zenbox) that do so for us.

You know what already uses herustics? ClamAV! https://blog.clamav.net/2011/03/top-5-misconceptions-about-clamav.html

I'll also note quickly that the blog post also indicates that the ClamAV team use sandboxes, though perhaps not in the automated way that you're envisioning (some sort of honeypot perhaps?)

Current methods: Other researchers would not have begun to produce new methods if the old methods are good enough for us. The old methods are to compile databases of signatures of infection, to undo the infection for us, which is not workable for new polymorphic viruses.

It is clear that you do not understand how antiviruses and endpoint protection services work. It is uncommon to 'undo the infection' (i.e. clean infected files), instead these tools focus on preventing the exploitation of a device by preventing the execution of "bad" code on an endpoint (and detecting and quarantining infected files).

How this affects us: Safety concerns are the main reason that autonomous robots do not work outdoors to mass produce structures such as houses to us. To remove the threat of infections from such tools, must use heuristical analysis, sandboxes plus artificial CNS. Controlled lab settings show that (versus humans) vehicles with autonomous OS reduce risks of crashes, so the only reason that all vehicles are not autonomous, -- and that all work is not autonomous -- is because of the threat of infections, which new methods for virus scanners could undo from us.

[citation needed]

@ETERNALBLUEbullrun
Copy link
Author

ETERNALBLUEbullrun commented Mar 21, 2024

Gemini is not able to follow links or parse sources.
APXR is not an exact clone of human's CNS, but advances past human's CNS (original post now has stuff about apxr_run)

Lots of antiviruses are able to undo infection from programs,
for cases of infections that spread to normal programs.
If the whole program itself is an infection, you should undo it from us.
For years, lots of virus scanners could undo simple infections from programs,
(such as infections that just add a few blocks of code to the end of the file and patch the entry point to run the infection at the end before jumping back to the front and resuming the normal program, which are the most simple to undo from normal programs.)
But CNS virus scanners could undo much more advanced/complex infections from programs,
and restore the normal programs back to us,
because an artificial CNS is capable of all that a human CNS is,
but with more neurons and layers of cortices,
and the virus scanner CNS would devote all neurons to processes to parse hex dumps of programs and setup synapses to recover programs (or undo if the whole file is an infection with no uses.)

Was stupid to not have found those pages about how ClamAV/ClamScan uses some heuristical analysis,
you have done good to us with this. Oops.
But as "AI"/artificial CNS becomes more common,
is important for virus scanners to use such tools to secure us.
Humans can not react as fast.

@micahsnyder
Copy link
Contributor

But as "AI"/artificial CNS becomes more common,
is important for virus scanners to use such tools to secure us.
Humans can not react as fast.

I agree with the sentiment of your request. It is a good request to investigate AI / ML to identify malware.

Just last week, the Snort team released SnortML, which is a module for Snort that may load ML models to classify HTTP URI inputs to identify zero day attacks: https://blog.snort.org/2024/03/talos-launching-new-machine-learning.html It would be wonderful to add detection capabilities to ClamAV. It seems like a promising research area for folks interested in malware research.

@ETERNALBLUEbullrun ETERNALBLUEbullrun changed the title To produce better virus scanners to secure us, could have training data = inputs of all infected files/programs (such as samples from Virustotal), where as outputs = fresh programs (or "Null" if no fresh programs to return to), to produce artificial CNS to undo infections from files/programs Virus analysis tools should use local heuristical analysis/sandboxes plus artificial CNS Apr 23, 2024
@ETERNALBLUEbullrun
Copy link
Author

Updated original post (English fixes, + extra examples/sources)

@micahsnyder
Copy link
Contributor

This is too large of a request. If you want to make such a thing, we could possibly accept a pull request with this kind of feature added. It is also probably too resource intensive to run on the devices that ClamAV uses.
Another strategy is to make AI/ML models and run them in the backend to generate signatures that are static.
In any case, since this is so far from what we do, and since we don't have the resources to work on it, I am closing this request.

@ETERNALBLUEbullrun
Copy link
Author

ETERNALBLUEbullrun commented Apr 25, 2024

It is also probably too resource intensive to run on the devices that ClamAV uses.

Is fast with caches.
Introduced pseudocodes to do static analysis + sandbox + CNS.
What's left is the specifics (what patterns/functions should static analysis flag for review? what outputs from strace should flag for review? which artificial CNS is best for this, how much layers to use, how much neurons to use, what activation functions best for this?)
If you do not care about the specifics, could just use the most simple to implement and submit a pull request.
But want to know what requirements you have to accept this.

To train (produce synaptic weights for) the CNS, is slow plus requires access to huge sample databases,
but the synaptic weights use small resources, plus allow the client to do fast analysis.

@ETERNALBLUEbullrun

This comment was marked as duplicate.

@ETERNALBLUEbullrun
Copy link
Author

ETERNALBLUEbullrun commented Apr 29, 2024

Another strategy is to make AI/ML models and run them in the backend to generate signatures that are static.

Artifiicial central nervous system's backpropagation/forwardpropagation (massive paralellization) is not suitable to do lossless formulas to compress (to produce signatures has lots of tight loops, close to how you produce codebooks for formulas such as Bzip2).
Original post now has fast (versus manual creation of signatures) functional approach to produce signatures; produceAbortListSignatures(), which uses listProduceUniqueSubstr(), which uses loops + listHasSubstr(). This produces signatures = the smallest substr unique to files with infection (substr does not appear in fresh SW).
To identify which file has infection, original post now has functions to do static analysis + autonomous sandbox + artificial CNS.
To produce the signatures is slow, the sandbox is slow, to produce the CNS is slow.
The signatures produced are small, the client can use the signatures fast.
The client can use the CNS fast.
The static analysis is fast.

@ETERNALBLUEbullrun

This comment was marked as duplicate.

@ETERNALBLUEbullrun
Copy link
Author

ETERNALBLUEbullrun commented May 30, 2024

Original post was pseudocode, is now C++.
If submit a pull request, would base off of this.
Is this good enough for you?

@ETERNALBLUEbullrun
Copy link
Author

Original post has new fixes. Comments have new fixes.

@micahsnyder
Copy link
Contributor

@ETERNALBLUEbullrun The concepts you're discussing is so much outside my wheelhouse it mostly sounds like ChatGPT make up some tech jargon.

The code you shared isn't what I would call C++. It's just C++ wrapping around Python code.

Sorry, we're not interested.

@ETERNALBLUEbullrun
Copy link
Author

ETERNALBLUEbullrun commented Jun 17, 2024

The code you shared isn't what I would call C++. It's just C++ wrapping around Python code.

class Cns is "TODO"/"work-in-progress".
Have removed the tentative HSOM (which is a Python lib) implementation of class Cns from original post (it was not a significant part of this issue).

SwuduSusuwu/SubStack#6 "HSOM (Python) / apxr_run (Erlang) too difficult to include; produce C++ artificial central nervous sys
...
Lots of FLOSS C++ neural networks to use as to implement class Cns interfaces, such as:
https://github.com/yixuan/MiniDNN
https://github.com/gantoreno/iris "

Was that the sole concern? With C++ implementation of class Cns, Cisco-Talos accepts this?

@ETERNALBLUEbullrun
Copy link
Author

ETERNALBLUEbullrun commented Jun 17, 2024

The concepts you're discussing is so much outside my wheelhouse it mostly sounds like ChatGPT make up some tech jargon.

Last post before this ( #1206 (comment) ) was about how to produce virus signatures (which is just one submodule of this issue). Is this what you are referring to?

Am curious: what can you ask ChatGPT which has a chance to produce this? Which part confused you?

Was it the part about how formulas to compress data (lossless) with codebooks, are close to formulas to produce virus signatures? Formulas such as bzip2 use tight loops to produce codebooks (not actual books, just lists of unique substrings) so that the compressed file includes each substring just once. This was a response to the suggestion to use artificial intelligence (which is lossy) to produce the signature lists.

clang++ / g++ can compile static libs from the sources (git clone https://github.com/SwuduSusuwu/SubStack.git && ./make && (find ./ | grep *.o)),

produceAbortListSignatures(const ResultList &passList, ResultList &abortList) is finished (produces smallest possible virus signature lists).

This is not a concept, runnable C++ source exists.

@ETERNALBLUEbullrun
Copy link
Author

ETERNALBLUEbullrun commented Jun 19, 2024

g++ -c cxx/ClassSha2.cxx gives ClassSha2.o
g++ -c cxx/ClassResultList.cxx gives ClassResultList.o
g++ -c cxx/VirusAnalysis.cxx gives VirusAnalysis.o
Usage;

#include "cxx/VirusAnalysis.hxx"
const bool produceSignatures() {
	abortList.bytecodes = ...  /* Infested-files */;
	passList.bytecodes = ... /* Files which pass */;
	if(produceAbortListSignatures(passList, abortList)) {
		/* dump abortList.signatures to disk */
		return true;
	}
	return false;
}
const bool passesAnalysis(const PortableExecutable &executable) {
	return signatureAnalysis(executable, Sha2(executable.bytecode));
}

PortableExecutable with signatureAnalysis does not have differences for Portable Executables (Microsoft) versus ELF files (Linux/Unix).

@ETERNALBLUEbullrun
Copy link
Author

ETERNALBLUEbullrun commented Jun 19, 2024

Was the confusion from the original post's For comparison; produceVirusFixCns is close to assistants (such as "ChatGPT 4.0" or "Claude-3 Opus",) have such demo as produceAssistantCns;?
This meant that produceAssistantCns is an alternative to such assistants, not that such assistants produced this.
The purpose of this text was that, due to how complex produceVirusFixCns is, to have comparisons to tools (such as those bots) which exist.
Those tools can detect simple problems in text (such as typos,) plus produce fixes. produceVirusFixCns produces a Cns which can detect simple infections in executables, plus produce fixes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants